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Abstract 

In this chapter an analysis of the behavior of an arbitrary (perhaps 
massive) collective of computational processes in terms of an associated 
“world” utility function is presented We concentrate on the situation 
where each process in the collective can be viewed as though it were 
striving to maximize its own private utility function. For such situations 
the central design issue is how to initialize/update the collective’s struc- 
ture, and in particular the private utility functions, so as to induce the 
overall collective to behave in a way that has large values of the world 
utility Traditional “team game 75 approaches to this problem simply set 
each private utility function equal to the world utility function. The “Col- 
lective Intelligence 75 (COIN) framework is a semi-formal set of heuristics 
that recently have been used to construct private utility functions that 
in many experiments have resulted in world utility values up to orders 
of magnitude superior to that ensuing from use of the team game utility. 
In this paper we introduce a formal mathematics for analysing and de- 
signing collectives. We also use this mathematics to suggest new private 
utilities that should outperform the COIN heuristics in certain kinds of 
domains. In accompanying work we use that mathematics to explain pre- 
vious experimental results concerning the superiority of COIN heuristics. 
In that accompanying work we also use the mathematics to make numer- 
ical predictions, some of which we then test. In this way these two papers 
establish the study of collectives as a proper science, involving theory, 
explanation of old experiments, prediction concerning new experiments, 
and engineering insights. 


Introduction 

This paper concerns distributed systems some of whose components can be 
viewed as though they were agents, adaptively ‘Trying” to induce large values 
of their associated private utility functions. When combined with a world utility 
function that rates the possible behaviors of that system, the system is known 
as a collective [17, 20, 23, 25]. 
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Given a collective, there is an associated inverse design problem, of how to 
configure/modify the system so that in their pursuit of their private utilities 
the agents also ma ximiz es the world utility Solving this problem may involve 
determining/modifying the number of agents, how they interact with each other, 
and what degrees of freedom of the overall system each of them controls (Le., the 
very definition of the agents). When the agents are machine learning algorithms 
overtly trying to maximize their private utilities, the inverse problem may also 
involve determining/modifying the algorithms that those agents use, as well as 
precisely what private utilities they are each trying to maximize. 

This paper presents a mathematical framework for the investigation of col- 
lectives, and in particular the investigation of this design problem. A crucial 
feature of this framework is that it involves no modeling of the underlying sys- 
tem nor of the algorithms controlling the agents. For example, only the behavior 
of an agent (or more precisely, certain broad aspects of it) is formally related to 
what private utility that agent is “trying” to maximize; nothing of what goes on 
“under the hood” is assumed. This behaviorist approach is crucial since in the 
real world collectives are often so complicated that no tractable model can bear 
more than a cursory similarity with the system it is supposed to represent. More 
generally, this approach is crucial to have the framework be broad enough to 
encompass, for example, the collectives of spin glasses and of human economies. 

In the next section we introduce generalized coordinates. These allow us to 
avoid any restrictions on the kinds of variables comprising the system — they can 
be uncountable, countable, or combinations thereof, with or without an under- 
lying topology/metric, and except where explicitly indicated otherwise, all the 
results of the framework still apply. The underlying variables can either include 
time or not, and if they do, the associated underlying dynamics is arbitrary. The 
variables. also can either be broken up explicitly into separate agents or not, and 
if they are, there can be arbitrary restrictions on which of the conceivable joint 
moves of the agents are physically allowed. In addition, how the variables are 
broken up into agents, and even the number of agents is arbitrary, and can be 
modified dynamically (if time is included in the underlying variables). More- 
over, if time is included as an underlying variable, then some of the agents can 
have their decision “simultaneously” fix the state of one or more variables of 
the system at distinct moments in time . (This is reminiscent of what is decided 
in settling on a contract in cooperative game theory.) Again, all of this can be 
varied in an arbitrary fashion. _ 

Using these generalized coordinates, a central equation can be derived that 
determines how well any of these kinds of systems perform. It does so by breaking 
performance down into three terms. These terms loosely reflect the concerns 
of the fields of high-dimensional search, economics, and machine learning; the 
central equation is the bridge that couples those fields. 

The following section uses this mathematical framework to introduce a (model- 
independent) formalization of the assumption that a particular component of 
the system is a “utility-maximizing. . .agent”. That formalization is then used 
to derive the Aristocrat and Wonderful Life private utility functions, two utility 
functions previously intuited that have been found to result in far better world 
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utility than conventional techniques. [17]. This derivation also uncovers (rela- 
tively rare) conditions under which those utilities should not perform very well. 
That section ends by deriving many new results, including the Collapsed private 
Utility, and ways to modify other agents to help a particular agent, along with 
specification of the scenarios in which such techniques should result in good 
world utility. 

An accompanying paper [22] presents this mathematical framework in a more 
pedagogical manner, including many examples, commentary and some discus- 
sion of related fields (e.g., mechanism design in game theory). That paper also 
discusses recent experiments involving a set of previous semi-formal heuristics 
(including the Aristocrat and Wonderful Life private utilities) that have been 
found to be very useful for the design of collectives. It uses the mathematical 
framework to explain the efficacy of those techniques. It then goes on to make 
numerical predictions based on that framework, and then presents some experi- 
mental tests of those predictions. It ends by making other (testable) predictions, 
and presents a sample of future research topics and open issues. 

This paper instead exhaustively presents all of the currently elaborated 
mathematics of the framework, including the details omitted in [22], In particu- 
lar, this paper contains theorems not presented there, extensions of the theorems 
that are presented there, the proofs of all theorems, detailed application of the 
framework to multi-step games, and the important example of applying the 
framework to gradient ascent over categorical variables. (For pedagogical rea- 
sons, the latter two occur as appendices.) Combined, these two papers present 
a mathematical theory along with associated predictions/experiments and en- 
gineering recommendations. In this, they lay the foundation for a full-fledged 
science of collectives. 

1 The Central Equation 

(i) Generalized coordinates and intelligence 

We axe interested in addressing optimization problems by decomposing them 
into many subproblems, each of which are solved separately. We will not try 
to choose such subproblems so that they axe independent of one another, or 
find a way to coordinate their solutions. Rather we will choose the subproblems 
so that each of them separately is relatively easy to solve, given the context of 
a particular current solution to the other subproblems, and then have them be 
solved in parallel. 

To formalize this, let £ be an arbitrary space with elements z called world- 
points. Let C C ( be the set of elements of C that axe actually allowed, for 
example in that they axe consistent with the laws of physics. 1 Define a gener- 
alized coordinate variable as a function from C to associated coordinate 

1 Whenever expressing a particular system as a collective, it is a good rule to write out the 
functional dependencies presumed to specify C(.) as explicitly as one can, to check that what 
one has identified as the space ( does indeed contain all the important variables. 
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values. (When the context makes the precise meaning clear, we will sometimes 
use the term “coordinate” to refer to a generalized coordinate variable, and 
sometimes to a value of that variable.) We will sometimes view a coordinate 
variable p as an exhaustive partition of C into non-empty subsets, with p(z) be- 
ing the element of the partition that contains z. Accordingly we will sometimes 
write a coordinate value r = p{z) as “r 6 p” and a worldpoint z ' sharing that 
value as “z' € r” ? Intuitively, each “sub-problem” of our overall optimization 
problem will be formalized in terms of such a partition p, as finding the optimal 
z within the r € p specified by the current solutions to the other subproblems. 

Often we implicitly assume that the set of values that any coordinate vari- 
able we axe discussing can take on forms a measurable set, as does the set of 
worldpoints having any such value. (All integrals are implicitly with respect to 
such measures.) 

As an example, C might consist of the possible joint actions of a set of 
computational agents engaged in a non-cooperative game [7, 2, 10, 3, 5]. p(z € 
C) could then be the actions of all agents except some particular agent identified 
with p. In this case, by fixing all other degrees of freedom, the value of the 
coordinate p implicitly specifies the degrees of freedom that are still “available 
to be set” by the agent identified with p. 

A frequently occurring type of coordinate variable is one whose values are 
contained in the real numbers. A particularly important example is a world 
utility function G : C — ► 91 that ranks the various possible worldpoints of 
the system. We are always provided a G ; the goal in the problem of designing 
collectives is to maximize G. 

Our mathematics does not concern G alone, but rather its relationship with 
some coordinate utilities g p : C — ► DR 3 Each coordinate utility ranks the 
possible values of those degrees of freedom still allowed once the worldpoint has 
been restricted to a set of worldpoints r Gp. Given a set of coordinate variables, 
{p}, we are interested in inducing a z that each g p ranks highly (relative to the 
other worldpoints in the associated set r — p(z)), and in the relation between 
those rankings of z and G’s ranking of z. To analyze these issues we need to 
standardize utility functions so that the numeric value they assign to z only 
reflects their relative ranking of z (potentially just in comparison to the other 
worldpoints sharing some associated coordinate value). 4 

Generically, we indicate such a standardization by AT, and for any utility 
function £7, coordinate p, and z 6 C, we write the associated value of such a 
standardization of the utility U as N p ^u(z). Define “sgn[x]” to equal +1, 0, or 
— 1 in the usual way. Then we only need to require of a standardization N that 
N p jj(z) be a [0, l]-valued, p-parameterized functional of the pair (£7, £7(z)), one ‘ 
that meets the following two conditions as we vary U and/or z: 

2 In general, we try to use lower-case greek letters for coordinates, and the associated lower- 
case roman letter for the value of that coordinate. 

3 In previous work, roughly analogous utilities were called “personal utilities” [17]. 

4 It turns out that there never arises a reason to consider the relation between such a stan- 
dardization and the axioms conventionally used to derive utility theory [10], and in particular 
those axioms concerning behavior of expectation values of utility. 
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i) V z € C y if for a pair of utility V and W, sgn[VU (z 7 ) — W (z)] — sgn[V (z 7 ) — 
V(z)] V z 7 € /?(z), then 7V p>w (z) = iV p? v(»* 

ii) With U and r € p fixed, V z, z 7 € r, sgn[jV p jy (z) - 7V p cj(z 7 }] = sgn[U (z) — 
U{z*)\. 

We call the value of N p jj at z the “intelligence of z (given p) with respect 
to U for coordinate p 77 . 5 ,6 If p consists of a single set (all of C), we simply write 
Njj(z). An example of an intelligence operator based on percentiles is provided 
in App. A. Unless explicitly stated otherwise, whenever calculating intelligence 
values in any examples, we will use this choice of the intelligence operator. 

Often there will be uncertainly in the worldpoint z, in particular on the 
part of the system designer (e.g., when worldpoints are worldlines of a physical 
system, such uncertainty arises if the designer is not able to calculate exactly 
how the system evolves). Such uncertainty is captured by a distribution P(z) 
that equals 0 off of CJ Accordingly, coordinates p are not only partitions, but 
are also random variables, taken values r € p. 

All aspects of the designer’s ability to manipulate the system are encap- 
sulated in the selection of an element s from some design coordinate cr. In 
particular, since the (sub)problem of finding a z € r with maximal p-intelligence 
will vary as r varies, it cannot be addressed with conventional algorithms for 
maximizing a static function. Instead, its solution requires techniques — like 
those in reinforcement learning — tailored for dynamically varying and/or un- 
certain functions. Accordingly, we will often consider the case where (among 
other things) s specifies which of a set of allowed private utility functions to 
associate with some coordinate P,9p lS : z — + 9t. Such a function is one that 
we view intuitively as the “payoff function” for a self-interested computational 

5 Note that for fixed U, the function from C — ► can be viewed as a utility 

function, and therefore as a coordinate. In particular, P Pi N PtU = ^p,U- This follows from 
condition (i) in the definition of intelligence with V ~ U> W = -Np,v, and the equality of sgn’s 
following from condition (ii) in the definition of intelligence. 

6 Although this paper concentrates on Devalued utility functions, much of its analysis can 
be extended to functions having different ranges. Examples include vector- valued functions 
having range fH* 1 — appropriate for analyzing intelligence with respect to several distinct XJ 
at once — and functions whose range is a set of non-overlapping contiguous sub-intervals 
of In particular, given some such range Q, and any associated antisymmetric preference 
function F : Q x Q — ► {—1,0, 1}, we can replace the sgn function with F throughout (i) and 
(ii) when we specify our intelligence operator. Much of the sequel (e.g., Thm. 1) still holds 
under this modification. If in addition Q is a field over the reals, we can also form the average 
value of such an intelligence, and some of the theorems presented below concerning expected 
intelligence values will go through. 

7 If there is uncertainty in C itself we express that with a distribution P(0) } to go with the 
distributions P(z | C). In particular, if probabilities reflect the system designer’s uncertainty 
about C, then P(z) may be non-zero even for points z off of the actual C. Fixing C exactly 
is analogous to fixing the energy exactly in statistical physics (the microcanonical ensemble), 
with allowing C to vary being analogous to uncertainty in the energy (the canonical ensemble). 
Unless explicitly stated otherwise, in this paper we will consider C to be fixed. In a similar 
fashion, if probabilities reflect uncertainty in how a coordinate n partitions C, then it could 
be that P(z | k) is non-zero even for points z where /c(z) ^ k. (For simplicity, we will usually 
assume this is not the case.) 
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agent, embodied in C, that uses a “learning algorithm”, to “control” position 
within any particular element of p. 8 A priori , a coordinate need not have an as- 
sociated private utility; in particular, non- learning agents need not. Informally, 
when we have a “learning agent” associated with coordinate p we refer to p 
as either the agent coordinate or the agent’s context coordinate, with the 
value of that coordinate being the agent’s context. (These definitions are made 
more formal below.) 

Properly interpreted, the rules of set theory hold when coordinate variables 
play the role of sets. Under this interpretation any coordinate variable k arising 
in a set-theoretic expression should be read as “every (subset of £ that consti- 
tutes an) element of k” . For example, k C A means “every element of k, is a 
proper subset of every element of A”, so that the value k fixes L See App. B. 

As a notational matter, we adopt the usual convention that probability of 
a coordinate value is shorthand that the associated random variable takes on 
that value, e.g., P(a) means P(a = a). As usual though, this convention is not 
propagated to expectation values: E(U(a^p) \ c) = J dbU(a, b)P(b | c). Delta 
functions are either Kronecker or Dirac as appropriate (although always written 
as arguments rather than as subscripts). Similarly, integrals are assumed to have 
a point-mass measure (i.e., reduce to a sum) as appropriate. For any function 
<p : C — ► Dd and coordinate /c, with y € [0, 1], we write CDF^(y. | k) to mean the 
cumulative distribution function P(0 < y \ k) = d tf d zP(z | k) S(<j>(z) — t ), 

and just write CDF(0 | k) to refer to the entire function over y. In addition, 
“supp” is shorthand for the support operator, and “55” indicates the Booleans. 
0{A) means the cardinality of the set A. For any two functions fa and fa with 
the same domain x € X, “fa < fa” means that Vx /i(x) < fa(x), and 3x such 
that /i(x) < fa(x). All proofs that are not in the text are provided in App. C. 

(ii) The Central Equation 

Our analysis revolves around the following central equation for P(U | s), 
which follows from applying Bayes’ theorem twice in succession: 

P(U I s)= J d Nu P{U I Nu, s) J dN g P(Nu \ N g , s)P(N g \ s ) (1) 

where usually we are interested in having U = G. “ g ” is the vector of the values 
of-a- set of coordinate utilities, and is an associated vector of intelligences 
with respect to those coordinate utilities. Here we concentrate on the case where 
each of those intelligences is for the associated coordinate, i.e., for set of coordi- 
nates {p} it is the p-indexed vector with components {N Pj 9 p (z)}. “Nu” is also a 
coordinate-variable-indexed vector of intelligence values, only for utility U. We 
will concentrate on the case where Nu is indexed with the same coordinates as 
N g . In this situation Nu has components N Pi u(z) and is identical to N g except 

8 Note that, formally speaking, the learning algorithm itself is embodied in C. Hence the 
quotation marks around the term ‘control’. 


6 



in its choice of utility functions. 9 

If we can choose s so that term 3 in the integrand in Eq. 1 is peaked around 
vectors N g ah of whose components are close to 1, then we have likely induced 
large intelligences. If in addition to such a good term 3 we can have term 2 be 
peaked about Njj equal to N g , then Nu will also be large. If in addition term 1 
in the integrand is peaked about high U when Njj is large, then our choice of s 
will likely result in high U, as desired. 

In the next subsection we analyze what coordinate utilities give the desired 
form of term 2 in the central equation, for our choice of Ng and N g . We then 
present examples illustrating such systems and more generally illustrating gen- 
eralized coordinates. We end this section with a brief discussion of term 1. Then 
in the next section we analyze what coordinate utilities give the desired form of 
term 3 in the central equation. It is only here that the use of agents to control 
some coordinate values becomes crucial. We end that section by combining these 
analyses to derive coordinate utilities that have the desired forms for both term 
2 and term 3. 

This formalism applies to many more scenarios than those that involve dy- 
namical systems with values z specifying behavior across time. It also applies 
even in scenarios that are not conventionally viewed as instances of game theory. 
Nonetheless, as an example of the formalism, App. D is a detailed exposition of 
multistep games in terms of this formalism. 

(iii) Term 2 — Factoredness 

We say that U± and I7 2 are (mutually) factored at a point z for coordinate p if 
N P , t Ui(z') = N p ,u 2 (z f ) V z' € p(z). 10 Note that factoredness is transitive. If we 
do not specify (7 2 , it is taken to be G. and we sometimes say that U “is factored” , 
or “is factored with respect to G n , when U and G axe mutually factored. If V p 
in a set of coordinates that we are using to analyze a system, the utility g p is 
factored with respect to G for coordinate p at a point z, we simply say that the 
system is factored at z, or that the {g p } are factored with respect to G there. 

There is a very tight relation between factoredness and game theory. For ex- 
ample, consider the case where we have Pareto superiority of a point z f over some 
other point z with respect to the coordinate utility intelligences [7, 2, 10, 3, 5]. 
Say that in addition those associated utilities form a factored system with re- 
spect to the world utility G. These together imply the Pareto superiority of z 7 
over z with respect to world utility. The converse also holds. However these prop- 
erties relating factoredness, coordinate and world utilities only hold for Pareto 
superiority for intelligences (rather than for raw coordinate utility values), in 

9 Since the distributions in Eq. 1 are conditioned on s, when we have a percentile-style 
intelligence, a natural choice for the associated measure dp,{z) is given by the values r = p(z) 
and s, as P{z [ r)P(r | s) (see App. A). In other words, given that we are within a particular 
r, the measure extends across that entire context — including points inconsistent with s — 
according to the distribution P(z | r). 

10 In previous work we defined factoredness only to mean that sgn[£7i(z') — U\(z)] = 

— U 2 G)] V z* 6 p(z). This is a necessary (but not sufficient) condition that 
N p ,u 1 (z f ) = N p ,u 2 (z f ) V z! € p{z)\ see Thm. 1 below and the definition of intelligence. 
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general. In addition, by taking U 2 = G, the following theorem provides the 
basis for relating game-theoretic concepts like Nash equilibria and non-rational 
behavior with world utility in factored systems: 

Theorem 1 U\ and U 2 are mutually factored at z £ C for coordinate p iff 

sgn[£/i«) - U^z")} = sgn{U 2 (z') - U 2 (z")] Vz\z" 6 p{z). 

Note that this holds regardless of the precise choice of N, so long as it meets 
the formal definition of an intelligence operator. 

By Thm. 1, for a system whose coordinate utilities are factored with respect 
to G, the set of Nash eq uili bria, of those coordinate utilities equals the set of 
points that are maxima of the world utility along each of the coordinates indi- 
vidually (which of course does not mean that they are maxima along off-axis 
directions). 11 In addition to this desirable equilibrium structure, factoredness 
ensures the appropriate off-equilibrium structure; so long as for each coordinate 
the associated intelligence is high (with respect to that coordinate’s utility), the 
system will be close to a local maximum of world utility. This is because, for 
each coordinate p, given a (fixed) associated coordinate value r, any change in 
z £ r that decreases p’s coordinate utility — which is almost all changes if p’s 
intelligence is high — will assuredly decrease world utility. Note though that hav- 
ing factored with respect to G does not preclude deleterious side-effects on 
the other coordinate utilities of such a g p -improving change within r. All such 
factoredness tells us is whether world utility gets improved by such changes (see 
the end of App. D). 12 

1 1 An immediate game-theoretic corollary is that any game whose utilities can be expressed 
as coordinate utilities of a system that is factored with respect to a world utility having critical 
points has at least one pure strategy Nash equilibrium. However consider an arbitrary vector 
rail of whose components lie in [0, 1]. Then it is not the case that every factored system has 
a pure strategy joint profile with each player’s intelligence given by the associated component 
of e. This is even true if every component of sis either a 0 or a 1. As a simple example, choose 
g x = g 2 = G, and have s = (0,1). Have G = z\ for z 2 > 1/2, and equal 1 — z\ otherwise, 
where both z\ and 22 6 [0, 1]. Then if 22 > 1/2, z\ = 1, since N\ = 1. However if z\ — 1, then 
22 6 [0, 1/2] since N 2 = 0. If z 2 < 1/2 though, z\ = 0, which means that z 2 € (1/2, 1]. QED. 

12 Factoredness is simply a bit; a system is factored or it isn’t. As such it cannot quantify 
situations in which term 2 has a good form although it is not exactly a delta function. Nor 
can it characterize “super-factored” situations in which that conditional distribution is better 
than a delta function, being biased towards Nq values that exceed the N g values. One way to 
address this deficiency is to define a “degree of factoredness” . One example^of such a measure 
is 1 - / d 2 P(z | s)[Ng - Ng) 2 € [0,1]. Another is / d z P(z \ s)[N G - N g ], which extends 
from “partially factored” systems (negative values), to perfectly factored systems (value 0), 
to super-factored systems (value greater than 0). Other definitions arise from consideration _ 
of Thm. 1. For example, one might quantify factoredness for coordinate p as the probability 
that a random move within a context changes G and g p the same way: 

J dzdz'P{z I s)P{z’ I s)S(z' e p(z))0([G{z) - G{z')\[g p {z) - g p {z ') ]). 

Especially when one has a percentile-type intelligence, all these possibilities suggest yet 
other variants in which the measure d^i(z) replaces the distribution^ ) P(z | s). Similarly, 
one can define “local” (degree of factoredness) about some point z" by introducing into the 
integrands of all these variants Heaviside functions restricting the worldpoint to be near z n . 
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The following theorem gives the entire equivalence class of utilities that axe 
mutually factored at a point: 

Theorem 2 U\ and U 2 are mutually factored at z for coordinate p iff V z' £ 
r = p(z), we can write 

Ui(z') = $ r (U 2 (z')) 

for some r -indexed function 3> r that is a strictly increasing function of its argu- 
ment across the set of all values U 2 (z' € r). (The form ofUi for other arguments 
is arbitrary.) 

Using some notational overloading of the function, by Thm. 2 we can en-- 

sure that the system is factored by having each g p {z) = 3> p (G(z) , p(z)) V z £ Q 
for some functions whose first partial derivative is strictly increasing ev- 
erywhere. Note that this factoredness holds regardless of C or P(z | s). The 
canonical example of such a case is a team game (also known as an ‘exact po- 
tential game’ [6, 12, 4]) where g p = G for all p. Alternatively, by only requiring 
that V z £ C does g p take on such a form, we can access a broader class of 
factored utilities, a class that does depend on aspects of C. 

As an example, define a difference utility for coordinate p with respect to 
utility as a utility taking the form D p {z) = f}(z)[D\(z) — D 2 (z)\ for some 
function D 2 and positive function /?(.), where both (3{.) and D 2 (.) have the 
same value for any pair of points z and z { £ C for which p{z) = p(z r ). (We will 
sometimes refer to Di as the lead utility of such a difference utility, with D 2 
being the secondary utility.) Since both /?(z) and D 2 (z) can be written purely 
as a function of p(z), by Thm. 2, a difference utility is factored with respect to 
Di . As explicated in the next subsection, for such a utility with Di — G, term 3 
in the central equation can be vastly superior to that of a team game, especially 
in large systems. In addition, as a practical matter, often D p can be evaluated 
much more easily than can D\. 

1 

(iv) Term 1 and alternate forms of the central equation 

Assuming term 3 results in a large value of N g , having factoredness then ensures 
that we have a large value of Nq as well. In this situation term 1 will determine 
how good G is. Intuitively, term 1 reflects how likely the system is to get caught 
near local maxima of G. If any maximum of G the system finds is likely to be 
the global maximum, then term 1 has a good form. (For factored systems, in 
such scenarios it is likely that a system near a Nash equilibrium it is near the 
highest possible G.) 

So for factored systems, for our choice of Nq and N g , term 1 can be viewed 
as a formal encapsulation of the issue underpinning the much-studied explo- 
ration/exploitation trade-off of conventional search algorithms. That trade-off 
can manifest itself both within the learning algorithms of the individual agents 
as well as in a centralized process determining whether those agents axe allowed 
to make proposed changes in their state ([26]). In this paper we will not consider 
such issues, but will instead concentrate on terms 2 and 3. 
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As mentioned, term 2 in the central equation is closely related to issues 
considered in economics and game theory (cf. Thm. 1 and note the relation 
between factoredness and the concept of incentive compatibility in mechanism 
design [7, 2, 14, 2, 10, 16, 8, 27, 13, 15]. On the other hand, as expounded below, 
term 3 is closely related to signal-noise issues often considered in machine learn- 
ing (but essentially never considered in economics). Finally, as just mentioned, 
term 1 is related to issues considered by the search community. So the central 
equation can be viewed as a way of integrating the fields of economics, machine 
learning, and search. 

Finally, an important alternative to the choice of Nu investigated in this 
paper is where it is the scalar In this situation, Njj is a monotonic trans- 

formation of U over all of C, rather than just within various partition elements 
of C. For this choice term 1 in the central equation becomes moot, and that 
equation effectively reduces to P(U | s) = / d N g P(U \ N g ,s)P(N g | s). The 
analysis presented below of the P(N g | s) term in the central equation is un- 
changed by this change. However the analysis of the P(Nu ( N g , s) term is now 
replaced by analysis of P(U \ N g , s). For reasons of space, we do not investigate 
this alternative choice of Njj in this paper. 

2 The Three Premises 

(i) Coordinate complements, moves, and worldviews 

Since intelligence is bounded above by 1, we can roughly encapsulate the qual- 
ity of term three in the central equation as the associated expected intelligence. 
Accordingly, our analysis of term 3 will be expressed in terms of expected intel- 
ligences. 

We will consider only one coordinate at a time together with the associated 
expected coordinate intelligence. This simplifies the analysis to only concern one 
of the components of e g together with the dependence of that component on 
associated variations in s, our choice of the element of the design coordinate. 
For now we further restrict attention to agent coordinate utilities, reserve “p”to 
refer only to such an agent coordinate with some associated learning algorithm, 
and take g p = £p )S - 13 The context will always make clear whether p specifies 
a coordinate (as when it subscripts a private utility), refers to the values the 
coordinate can assume (as in r £ p), indicates the associated random variable 
(as in expressions like P(U(x,p)) = f d rP(r)U(x,r)), etc. 

As a notations! matter, define two partitions of some T C £, 7Ti and 712 , to 
be complements over T C ( if z £ T — ► (7ri(z), 7 ^( 2 )) is invertible, so that, 

13 Note that changing p’s coordinate utility while leaving s unchanged has no effect on the 
probability of a particular G value; g p is just an expansion variable in the central equation. 
Conversely, leaving p’s coordinate utility the same while making a change to its private utility 
(and therefore to s, and therefore in general to the associated distribution over £> P(z \ s )) 
changes the probability distribution across G values. Setting those two utilities equal is what 
allows the expansion of the central equation to be exploited to help determine s. 


10 





intuitively speaking, 7Cj and i r 2 jointly form a “coordinate system” for T . 14,15 
When discussing generalized coordinates, this nomenclature is used with T im- 
plicitly taken to be C . (7Ti and 7r2 are coordinate variables in the formal sense 
if T = C.) We adopt the convention that for any coordinate p, * p, having la- 
bels /values written V, is shorthand for some coordinate that is complementary 
to p (the precise such coordinate will not matter) and that ~ ~ p ~ p. We do 
not take the operator to refer to values of a coordinate, only to coordinates 
as a whole. So for example, there is no a priori relationship implied between a 
particular element of ~ p that we write as “ ~ r ” , and some particular element of 
p that we write as “r” . 

We always have B(N p ,u | s) = / drdndxP(r | s)P(n j r,s)P(x | n)N Pj u(x, r). 
Accordingly, if we knew P(r | s ), and also knew one of P(n | r, s ) and P(x | n) 
but did not know the other, then we could in principle solve for that other dis- 
tribution so as to optimize expected intelligence. 16 Unfortunately, we usually 
do not know two of those three distributions, and so must take a more indirect 
approach. 

The analysis presented here for agent coordinates revolves around the issue 
of how sensitive g p is to changes within an element of p as opposed to changes 
between those elements of p. To conduct this analysis we will need to introduce 
two coordinates in addition to cr and p: £ and v. xl Given some ~p, rather than 
the precise element ~r 6 in general the agent associated with p can only 
control which of several sets of possible elements *r the system is in. This is 
formalized with the coordinate £ D We refer to £ as the move variable of 
the agent, and we refer to an x € £, and/or the set of z that that x specifies, 
as the move value of the agent. For convenience we assume that for all such 
contexts r and moves x there exists at least one z € C such that p(z) = r and 
£(z) = x. In general, what we identify as the £ -of a particular p need not be 
unique. Intuitively, such a partition £ delineates a set of r — » z maps, each such 
map giving a way that the agent associated with p is allowed to vary its behavior 
to reflect what context r it’s in. An agent’s move is a selection among such a 
set of allowed variations. An important example of move variables involving 
dynamic processes in presented in App. D. 

We assume that £(z) and p(z) jointly set the value of G(z) and of any we 
will consider. 18 Accordingly, we write when we mean the coordinate whose 
partition elements are identical to a’s but whose values are instead the private 

14 This characterization as a coordinate system is particularly apt if tti and 7r2 are minimal 
complements, by which is meant that there is neither a coarser partition 7r' D 7Ti such that it* 
and 7 T 2 are complements, nor a coarser partition ir n D 7T2 such that ir" and tti are complements. 

ls Note that it is not assumed that T — ► (pi, *> 2 ) taking points z to partition element pairs 
is surjective. 

16 Formally to implement this would require making an associated change to s, a change 
which in the case of solving for P(x | n) would have to be reflected in the value of n. 

17 Properly speaking, f and u should be indexed by p, as should the coordinates &g and cr- g 
introduced below; for reasons of clarity, here all such indices are implicit. 

18 Phrased differently, given the utility function, and the associated f and p, the minimal 
choice for £ is f X p. If the value s is not fixed by x x r, i.e., if it is not the case that cr D £(lp, 
then <j must also be contained in C, and similarly for v. 
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utility functions of p: jp : s € a Similarly, we will write N p when we 

mean the function (x,r, s) — * N P)9 s ( x ,r)- 

We refer to v as the worldview variable of the agent, and we refer to a 
n G v, and/or the set of possible z that that v specifies, as the worldview value 
of the agent. Intuitively, n specifies all the information — all training data, all 
knowledge of how the training data is formed (including potentially knowledge 
of its own private utility), all observations, all external commands, all externally 
set prior biases — that p’s agent uses to determine its move, and nothing else. It 
is the contents of the (perhaps distorting) “window” through which the learning 
algorithm receives information from the external world. 

Formally, there three properties a coordinate must possess for it to qualify as 
a worldview of an agent. First, if the agent does indeed use all the information in 
n, then the agent’s preference in moves must change in response to any change 
in the value of n. This means that V ni, G v, for at least one of the x € 

P(x\ n i) 7^ P(x | n 2). 19 Second, if the worldview truly reflects everything the 
agent uses to make its move, then any change to any variable must be able 
to affect the distribution over moves only insofar as it affects n. This means 
that with defined as the set of all non-£ coordinate we will consider in our 
analysis (e.g., cr, p for some other agent, their intersection, etc.), P(x | n, W) = 

P(x | n) V x £ £, n E v and W € Q such that P(rr,n, W) =£ o . 20,21,22 Finally, 
of all coordinates obeying these two properties, the worldview must be among 
those whose information maximizes the expected performance of the associated 
Bayes-optimal guessing, 23 i.e., 

J d bP(b | s)^(7 p {argmax x /{£(7 p (x',p) | b)},p] | b ) 

< J d nP(b | s)E( 7 p [argmax x -{£ , (7 p (a:' ! p) | n)},p] | n ). 

So P(n | s) is how the worldview varies with s, and P(x | n) is how the agent’s 
learning algorithm uses the resultant information. The P(x | s) induced by these 
two distributions is how the move of the agent varies with s. Alternatively, P(r | 
s) is the distribution over contexts . caused by our choice of design coordinate 
value, and the distribution P(x | r, s) = j d nP(x | n)P(n | r,s) gives all salient 
aspects of the agent’s learning algorithm and technique for inferring information 
abou r; the integral over r of the product of these two distributions says how 
choice of s determines the distribution over moves. 

19 When worldviews are numeric- valued, we can modify this requirement to be that the 
distribution P(x \ n) has to be sufficiently sensitive a function of n over all of v. 

20 Note that if all W are allowed, then in general the only choice for v obeying this restriction 
is v f . 

21 As a result of this requirement, P(r | x,n, W) = P(r \ n, W), P(x,r | n, W) = P{x \ 
n)P(r | n, W), etc. 

22 For any P{z) and coordinates a and (3 , one can always construct a coordinate S 7 ^ a. such 
that P(a | 6 , d) varies with d. So our assumption about u and Q constitutes a restriction on 
what coordinates we will consider in our analysis. 

23 If it were not for this requirement, £ could double as the worldview, and often so could a . 
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We will find it convenient to decompose a = a ^ D where cr^ is a 
coordinate whose value gives ^ s , and there is no coordinate lo D ^ with this 
property. (Intuitively, c^'s value is a component of s that specifies s and 
nothing more.) Also, from now on, we will often drop the p index whenever its 
implicit presence is clear. So for example, we will often write s g instead of s^. 

(ii) Ambiguity 

Since we do not know P(x | n) in general, we cannot directly say how n sets 
the distribution over x. Fortimately we do not need such detailed information. 
We only need to know the effect that certain changes to n have on particular 
characteristics of the associated distribution P(x | n) (e.g., the effect certain 
changes to n have on the “characteristic of P{x | n)” given by an n-conditioned 
expected intelligence E(Nu | n)). 

Now if there were any universal rule for how such characteristics affect ex- 
pected intelligence, then without any assumptions we could use such a rule to 
deduce that some particular choices of n are superior to others. That has been 
proven to be impossible however [ 18 , 21 ]. Accordingly, we must make some 
presumption about the nature of the learning algorithm, one that must be as 
conservative as possible if it is to apply to all reasonable algorithms. 

To see what presumption we can safely make concerning such effects, first 
note that the worldview n encapsulates all the information the agent might 
try to exploit concerning the x-dependence of the likely values of the pri- 
vate utility. That encapsulation given by n takes the form of the distribu- 
tion over the Euclidean vector of private utility values (y 1 ,^ 2 , ...) given by 
f drds £(£„ 5 (x 1 ,r) — y 1 )6(g /)S (x 2 ,r) - y 2 )... P(r,s | n). The agent works by 
'‘trying” to use this encapsulation to appropriately set its move. Our presump- 
tion must concern aspects of how it does this. Furthermore, if that presumption 
is to apply to a wide variety of learning algorithms, it must only involve the en- 
capsulated information, and not (for example) any characteristics of some class 
of learning algorithms to which the agent belongs. 

For simplicity, consider the case where there are only two possible moves, x 1 
and x 2 . The encapsulated information provided by n induces a pair of distribu- 
tions of likely utility values at those two x’s, / drds S{g p ^ 8 {x 1 ^r) — y) P(r , s | n) 
and / drds S(g ps (x 2 : r) — y) P(r, s j n), which we can write in shorthand 
as P(y;7 p ;n, x 1 ) and P(y; 7^; n,x 2 ), respectively. (Note that unlike n, the x 1 
value in this semicolon notation is a parameter to the random variable 7p, 
not a conditioning event for that random variable.) By definition of Von Neu- 
mann utility functions, for worldview n, the optimal move is x 1 if the expected 
value E(y;2j,l n i xl ) > ^(y^^n^x 2 ), and x 2 otherwise. In general though the 
learning algorithm of the agent will not (and often cannot) have its distribu- 
tion over x set to a delta function this way. Other aspects of P(y;7 p ;^^ 1 ) 
and P(y; 7^; n.x 2 ) besides the difference in their first moments will affect how 
P(x | n ) changes in going from the one n to the other. For example, it may be 
that if E(y;j JJ \n,x 1 ) > E(y\ 7^71, x 2 ), then if n is changed so that both the 
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probability of a relatively large y value at x 2 and the probability of a relatively 
small y value at x 1 shrinks, while the first moments of those distributions are 
unchanged, then the algorithm is more likely to choose x 1 with the new n than 
with the original one. 

In light of this, we want to err on the side of caution in presuming how 
changes to P{y^j J) \n^x 1 ) and P(y;j p ;n,x 2 ) induced by changing n affect the 
associated distribution P(x [ n). The most unrestrictive such presumption we 
can make is that if the entire distributions P(y; 7^; n, x 1 ) and P(y; 7^ n, x 2 ) are 
“further separated” from one another after the change in n, then P(x | n) gets 
weighted more to the higher of those two distributions. Such a presumption is the 
most conservative one we can make that holds for any learning algorithm, i.e., 
that is cast purely in terms of the set of posterior distributions {P{y\j J) \n^x)} 
without any reference to attributes of the learning algorithm. This can be viewed 
as a first-principles justification that it applies to any learning algorithm not 
horribly mis-suited to the learning problem at hand. 24 

To formalize the foregoing, consider the quantity 

Pig 1 = V 1 , 9 2 = y 2 ;n,z\x 2 ) = Pig^x 1 , p) = y 1 \ n)P(g l 7 (x 2 ,p ) = y 1 \ n), 
which expands into the distribution 

/ dr 1 dr 2 ds 1 ds 2 r 1 ) — y l )S(g s 2 (x 2 , r 2 ) — y 2 )P(r l , s 1 | n)P(r 2 , s 2 | n). 

This is the distribution generated by sampling P(r', s' | n) to get values of 
at x 1 , and then doing this again (in an IID manner) to get values at x 2 . This 
“semicolon” distribution is the most accurate possible distribution of private 
utilities values at x 1 and x 2 that the agent could possibly employ to decide 
which x to adopt to optimize that private utility, based solely on n. 

Now also fix a utility U that is a single- valued function of x. Our “most accu- 
rate distribution” induces the convolution distribution P(y = y 1 — y 2 \ n, x 1 , x 2 ). 
The more weighted this convolution is towards values of y that are large and 
that have the same sign as U(x 1 ) — U(x 2 ), the less likely we expect the agent 
to be “led astray, as far as U{.) is concerned” in “deciding between x 1 and x 2 ”, 
when the worldview is n. On the other hand, if the convolution distribution is 
heavily weighted around the value 0, then we expect the agent is more likely to 
be mistaken (again, as far as U is concerned) in its choice of x. . ... „ 

So consider changing n a to n b in such a way that the associated convolution 
distribution, P([g 1 -g 2 ] sgn[U (x 1 ) — U (x 2 )}; n a , x 1 , x 2 ) is more weighted upwards 
than is P{[g}-g 2 ] sgn[ 17 (x 1 ) — U (x 2 )\] n b , x 1 , x 2 ). Say this is the case for all pairs 
of x values (x^x 2 ), i.e., with worldview n a , the agent is less likely to be led 
astray for all decisions between a pair of x values than it is with worldview n b . 

24 If the learning algorithm and underlying distribution over utility values do not adhere to 
this presumption, then in essence that underlying distribution is “adversarially chosen” for the 
learning algorithm — that algorithm’s implicit assumptions concerning the learning problem 
are such a poor match to the actual ones — that the algorithm is likely to perform badly for 
that underlying distribution no matter what one does to s, n, or the like. 
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Our assumption is that whenever such a situation arises, if we truly have an 
adaptive agent operating in a leamable environment, then the agent has higher 
intelligence with respect to Z7, on average, with worldview n a . 

Now in general we can encapsulate how much a stochastic process over C 
weights some random variable V upward, given some coordinate value Z € A, 
with CDFv (t/'J Z) — the smaller this cumulative distribution function, the larger 
the Z-conditioned values of V tend to be. 25 Accordingly, we can use such a CDF 
to quantify how much more “weighted upward” our convolution distribution for 
n a is in comparison to the one for n 6 . (See App. A for how this CDF is related 
to intelligence.) 

To formalize this we extend the semicolon notation introduced above. Given 
a coordinate x whose value c is a single- valued function of (x, r, s), and arbitrary 
coordinate A, define the (x 1 , x 2 , Z)-parameterized distribution over values c^c 2 , 


P(x 1 ,X 2 --,hx 1 ,x 2 ) = Pxic 1 ^ 2 ;^! 1 ,! 2 ) 


j dr 1 dr 2 ds 1 ds 2 P^r 1 ^ 1 ( l)P(r 2 ,s 2 | l) 


r/w^ 1 _l 0 l\ 

> 7 i s ) 


^(xteV^s 2 ) -c 2 ). 


So in this expression x is a random variable that is (being treated as) pa- 
rameterized by x, and we are considering its Z-conditioned distributions at x 1 
and x 2 . This notation is sometimes simplified when the meaning is clear, e.g., 

, x 1 , x 2 ) is written as P(c^, c 2 ; Z, x 1 , X 2 ). 

Expectations, variances, marginalizations, and CDF’s of this distribution 
and of functio nals of it are written with the obvious notation. In particular, 
P x (c;l,x ) = P{x{x,p,a) =c\ l), so P x {c l ,(?■, I ,! 1 ,x 2 ) = P x (<?; l, x 1 )P x (c 2 \ l, x 2 ). 
As another example, say that \ is the real- valued coordinate ip taking 
at (x 1 , r\ s l ). Then for any function / : Dl 2 — * IK, for any Z, 


ira Inoc 7/ 


CDF/( y i )y s) (y; l, x x ,x 2 ) = f dy 1 dy 2 P(y 1 ,y 2 ;l,x 1 ,x 2 )e[y -fiy 1 ^ 2 )] 

J — oo 

= J dr 1 dr 2 ds 1 ds 2 P(r 1 ,s 1 | Z)P(r 2 ,s 2 | Z) 


0[y ~ /(^(x 1 , r 1 , s 1 ), ^(x 2 , r 2 , s 2 ))] 


Using this notation, for any single- valued function U : x — + fR, we define the 
(ordered) ambiguity of U and ip, for Z, x 1 , x 2 , as the CDF of the associated 
convolution distribution: 


XJ, ip\ l, x , x ) — CDF^i sgn[u(x*)—u(x 2 )) ( 2 /? Z? % ) • 

Note that the argument of the sgn is just a constant as far as the integrations 
giving the CDF are concerned. That sgn term provides an ordering of the x’s; 

25 Let u be a real-valued random variable, and F : ^ £H a function such that F(y) > 

y . Vy e Then P{F(u) < y) < P(u < y) Vy, i.e., the monotonically increasing function 
F applied to the underlying random variable pushes the CDF down. Conversely, if CDFi < 
CDF2, then the function F(u) = CDFi" 1 (CDF2(u)) is a monotonically increasing function 
that transforms CDFi into CDF2- 
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ordered ambiguity says how separated our two y-distributions are “in the direc- 
tion” given by that ordering. When U is not specified, the random variable in 
the CDF is understood to be (ip 1 — ip 2 ) rather than (ip 1 — ip 2 ) sgn[U(x 1 ) — U(x 2 )\. 
It is easy to verify that such unordered ambiguities are related to ordered 
ones by 

A(y;U,ip;l,x 1 ,x 2 ) = 1/2 + t u (x 1 ,x 2 )[A(t u (x 1 ,x 2 )y;ip;l,x l ,x 2 ) - 1/2] 

where tu(x l ,x 2 ) = sgu.[U(x 1 ) — U(x 2 )}. 

We write just A(U, ip; Z, x 1 , x 2 ) (or A^l^x 1 ,x 2 )) when we want to refer to 
the entire function over all y . If that entire function shrinks as we go from one 
n to another — if its value decreases for every value of the argument y — then 
intuitively, the function has been “pushed” towards more positive values of y. 
Taking A = z/, such a change will serve as our formalization of the concept that 
the distributions over U at x 1 and x 2 are “more separated” after that change 
in the value of v. 

Expanding it in full we can write A(y; U, ip; n, x 1 , x 2 ) as 

J dr 1 dr 2 ds 1 ds 2 P(r l , s 1 | Z)P(r 2 , s 2 | l) 

Q[y-(ip(x 1 ,r l ,s 1 )-ip(x 2 ,r 2 ,s 2 ))sgn[U(x l )-U(x 2 )}], 
or, by changing coordinates, as 

J d y 1 dy 2 P v ,(y 1 ;Z,x 1 )F v ,(j/ 2 ;Z,a: 2 )0[y- (y 1 - y 2 )sgn(!7(a; 1 ) - U{x 2 )]), 

and similarly for unordered ambiguities. So ambiguity is parameterized by the 
two distributions P(ip ; l , x l ) as well as (for ordered ambiguities) U 26 As a final 
comment, it is worth noting that there is an alternative to A, A *, that also 
reflects the entire n-conditioned CDF of differences in utility values. It and our 
choice of A rather than A* is discussed in App. G. 

(iii) The first premise 

By considering ambiguity with ip — and A = z/, we can formalize our the 
conclusion of reasoning about how certain changes in n affect the probability of 
the agent’s “choosing” a particular x. We call this the first premise'” '" 

A{U,j J> \n a ,x l ,x 2 ) < A(U, 2 fi ',n b ,x 1 ,x 2 ) Vx\x 2 

=> 

CDF(i7 | n°) < CDF(*7 | n b ), 

26 Note that the ordered ambiguity does not change if we interchange x 1 and x 2 , unlike 
the unordered ambiguity. Note also that unless sgn[‘0(x 1 ) r 1 , s 1 ) — ^(x 2 , r 2 , s 2 )] is the same 
V (r 1 , s 1 ), (r 2 , s 2 ) G supp P(.,. | n), the associated ordered ambiguity is non-zero for some 
y < 0. More generally, to have the ambiguity be strongly weighted towards positive values 
of y, we need that sgn to be the same for all (r', s') in a set with measure (according to 
P(r',s' | n)) close to 1. 
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where £7, n a , and n b axe arbitrary (up to the usual restrictions, that zGC, that 
U is a function of x, etc .) 27 In other words, we presume that when the condition 
in the first premise holds, the distribution P(x | n a ) must be so much better 
“aligned” with U{x) than P{x | n b ) is that the implication in the first pre mis e 
(concerning the two associated CDF’s) holds. Note that that implication does 
not involve a specification of r; since in general the agent knows nothing about 
r, the first premise, which purely concerns P(x [ n), cannot concern r. 

Sum in arizing, U determines which of the two possible moves x 1 and x 2 by 
agent p are better; gp S is the (s-parameterized) private utility that agent p 
is trying to maximize, based exclusively on the value of the worldview, n (a 
worldview that may or may not provide the agent with the functional form of 
that private utility 

The first premise is, at root, the following assumption: If every one of the 
ambiguities A ( 7 ^; n Q , x 1 , x 2 ) (one for each (x 1 , x 2 ) pair) is superior (as far as U is 
concerned) to the corresponding A('y p \ n 6 , x 1 , x 2 ), then if we replace n b with n a , 
the effect on P(x | n) due to that superiority dominates any other characteristics 
of the two n’s. In addition, that dominating effect pushes P(x | n) to favor x’s 
having high values of U. As argued above, this is most broadly applicable rule 
relating certain changes to n and associated changes to an agent’s choice of x. 
There is no alternative we could formulate that is more conservative, Le., that 
applies to more learning algorithms, while only involving the distributions of 
the problem at hand confronting the algorithm. 

To explicitly relate the first premise to intelligence, we start with the fol- 
lowing result, which has nothing to do with learning algorithms, and which in 
particular holds regardless of the validity of the first premise. (Indeed, it can 
be seen as motivating the use of a CDF like ambiguity to analyze properties of 

i'nfallJn-fJTJ/'oc ^ 

Theorem 3 Given any coordinates a;, k and X , fixed k € k, and two functions 
V a : (w,k) IK and V b : (w, k) — » 91 that are mutually factored for coordinate 

K, 


CDF(V a | l a , k) < CDF(V 6 | l b , k ) 

E(N K y«\l a ,k) > E(N K y>\l b ,k) 

and similarly when the inequalities are both replaced by equalities . 

Now take cv = f and for a fixed k , define U (.) = V (., k) (so that U is a function 
of x). Then since P(x | n, k) — P(x | n) (by definition of worldviews), assuming 
both P(n a ,k) and P{n b ,k ) are nonzero, CDF(Z7 | n a ) < CDF(J7 | n b ) 
CDF(?7 | n a ,h) < CDF {U | n b ,k) =* CDF(V | n a ,k) < CDF(F | n b ,k). So 
if we choose A •= v in Thm. 3 and combine it with the first premise, we get 

27 Note that the functional (sic) inequality in the first premise is equivalent to 
< tu(x 1 ,x 2 )A(2 P ;n b ,x 1 ,x 2 ). In turn, this inequality implies 

that U(x 1 ) 7^ U(x 2 ), since otherwise txjijx 1 ^ 2 ) = 0 . 
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the promised relation between ambiguities based on the x-or dering V(x,k) and 
expected ^-intelligences of V conditioned on k and n. In turn, to relate the first 
premise to the problem of choosing s, use the fact that E{N K y I ^ k , s ) = 
E(N KiV (£, k) l n,k,s) = E(N K y I n,k) to derive the equality E(N K , V I s ) = 
/ dndkP{n, k \ $)E(N K y ! n, k). 

(iv) Recasting the first premise 

Below we will need to use a more general formulation of the first premise than 
that given above. To derive this more general form, start by defining a param- 
eterized distribution H whose parameter has redundant variables: 

P(x | Tl) = R{A(7 p ;n,xi,x 2 );xi,x 2 €C},^(' C ) 

Note that unordered ambiguity is used in this definition, and that H implicitly 
carries an index identifying the agent as p. 

In general, the complexity of P(x | n) can be daunting, especially if v is fine- 
grained enough to capture many different kinds of data that one might have the 
learning algorithm exploit. This complexity can make it essentially impossible 
to work with P[x | n) directly However in many situations it is reasonable to 
suppose that the dependence of H on its v argument is small in comparison 
to associated changes in the ambiguity arguments (e.g., n 3 s value does not set 
a priori biases of the learning algorithm across £, etc.). In such situations all 
aspects of P(x | n) get reduced to the dependence of H on ambiguities. In other 
words, in such situations the functional dependence of P(x \ n) on the set of 
ambiguities can be seen as a low-dimensional parameterization of the set of all 
reasonable learning algorithms P(x \ n). Accordingly, in these situations one can 
work with the ambiguities, and thereby circumvent the difficulties with working 
with P(x | n) directly. 

Another advantage of reducing P{x \ n) to H is that often extremely general 
information concerning P(jj> I n ) allows us to identify ways to improve ambi- 
guities, and therefore (by the first premise) improve intelligence. Reduction to 
if, with its explicit dependence on those ambiguities, facilitates the associated 
analysis. 

In particular, say that the worldview coordinate value specifies the private 
utility (or at least that we can assume that augmenting the worldview to contain 
that information would not appreciably change P(x j n)). This means that 
P( 7 p | ft), which arises in calculating ambiguities, can be replaced by | n), 

where is the private utility specified by n. Say that in addition, P(x \ n) not 
only is dominated by the the set of associated ambiguities (one ambiguity for 
each x pair), but can be written as a function exclusively of those ambiguities, 
a function whose domain is the set of all possible ambiguities. Under these two 
conditions we could consider the effects on P(x | n) of replacing the actual 
ambiguities {A^; n, x\ x J ) : x\x J E £} = {A^^jn,^ 1 ,^) : x\x J E £}, with 
counterfactual ambiguities {A{g fi S ,\ n,x l ,xi) : x \x^ E £} that are based on the 
actual ft at hand but are evaluated for some alternative candidate private utility 
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gp^t- Under certain circumstances, this approach could be used to determine 
what such candidate private utility to use, based on comparing the associated 
counterfactual ambiguities. 

To use this approach in as broad a set of circumstances as possible, we must 
address the fact that P(x\n) may have some dependence on n not fully captured 
in the associated ambiguities, e.g., when n modifies the learning algorithm, for 
example by specifying biases for the learning algorithm to use. This means the 
definition given above for H will not in general extend to parameter values whose 
ambiguity set does not correspond to n. Another hurdle is that often the domain 
of P(x | n) need not extend to all ambiguities of the form {A(g_ ps f\n,x l ,x^) : 
x\x J E £}. Finally, in general worldviews do not specify the private utility. 

To circumvent these difficulties we need to introduce new notation and recast 
the first premise accordingly. Start by extending the domain of definition of H to 
write it as H^ A ^y x \^y x i x 2^y n {x), for any coordinate value l € A C v. Here 
ip is an arbitrary real-valued function of x, r, and s, not necessarily related to 
7 p . So ff{A(ip;i i x 1 ,x 2 ): X 1 , x 2 €£},n(x) is not necessarily related to the actual P(x | 
n). Despite these freedoms, we require that for any value of its parameters 
&{A(ip]i,x 1 ,x 2 ):x l 1 x 2 eCi,n\ x ) * s a P ro P er probability distribution over x, one that 
for fixed ip and A = u is (like P(x | n)) parameterized by n. This extending of 
IP s domain is how we circumvent the first two of our difficulties. 

Next we introduce some succinct notation. As in the definition of worldviews 
let W eVL refer to the set of all non-£ coordinate we will consider in our analysis, 
and define the distribution pt^ ;A l(x, l, W) ~ ,x 2 )ix 1 ^ hn (x)P(l,W), 

where A C v. When ip ~ 7 p , we just write P^. So for example P^(x [ n) = 
pfc^ ] (x | n ) - P(x | n), Ptort(x | UW) = P^ x \x \ l ) = ff {W|X 2) a i^ €C) ,n(i), 
etc. Note also that p(2p ;x/ ^l(nr j n } s) = (x | n, s). Intuitively, we 

view the learning algorithm as taking arbitrary sets ambiguities and world- 
views as input and producing a distribution over x ; P^ ;Aj1 (x j l) is the distri- 
bution over x that arises when the learning algorithm is fed the ambiguities 
{A(ip: l,x J ,x 2 ) : x 1 , x 2 € £} and worldview n specified by L 
Now consider the following elementary result: 

Lemma 1 Consider any two probability density functions over the reals, Pi 
and P2, where pffi) > pffi) ^ u ^ uf € ^ where u > u'. Say we also have any 
: £H — ^ with nowhere negative derivative. Then CDFp 1 (<p ) < CDFp 2 (<p )- 

Combining this lemma with the first premise, and using our new notation, we 
arrive at the following version of the first premise, derived in the appendix: 

Theorem 4 Given coordinate values l a and l b E A C 1/, 3H such that 

A(U, ip a ;l a , x 1 , x 2 ) < A(U,ip b ; Z^x^x 2 ) Vx^x 2 

CDF !v,a;A] (C/ | l a ) < CDF [ ^ t; ^(J 7 | l b ), 

where as usual ip a , xb b and (the r -independent) U are arbitrary. 
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Figure 1: The solid line depicts an ambiguity A(y; V; Z, x 1 , x 2 ). The dotted line 
depicts A(y;KV;l,x 1 ,x 2 ) = A(y/K;V;l,x 1 ,x 2 ) for K > 1; the dashed line is 
A(KV}l,x l ,x 2 ) for 0 < K < 1. Neither of those scaled-utility ambiguities lies 
entirely below the original one. Accordingly, neither of those scaled utilities is 
recommended by the first premise. 


This theorem is illustrated geometrically in Fig. 1. 

Because it holds for any underlying distribution over Thm. 3 holds for 
CDF’s and expectation values based on any P^ x \ not just ptefi ;t/ h Since for 
any ip, P^ ;A ]( X | l,W) — P^ ;A l(x | Z), the discussion following Thm. 3 holds for 
conditioned on l just as well as for P conditioned on n. So Thm. 4 has 
the following corollary: 

Corollary 1 Given any coordinates k and A C i/, fixed k G and V : (x, k) —+ 
Eft, 3 H such that 

^(V(.,fc),i/'V‘\z 1 ,z 2 ) < A(V(.,k),-ip b ;l b ,x 1 ,x 2 ) Vx\x 2 

E^ a ' x] (N K y \l a ,k) > E^ b; ^(N K y \l b ,k) 

Summarizing, for a particular value of k, V determines which of the two 
possible moves x l and x 2 by agent p are better; g_p S is the (s-parameterized) 
private utility that agent p is trying to maximize, based exclusively on the value 
of the worldview, n (a worldview that may or may not provide the agent with the 
functional form of that private utility); ip a and ip h are two real- valued functions 
of x, r and s that are used to evaluate ambiguities, and l a and l h are values of a 
conditioning variable for evaluating- ambiguities; a variable that specifies n at a 
minimum. In addition, H is a parametrized distribution over x that is- defined for 
any parameter value that consists of 0(f) CDF’s and a worldview, a distribution 
that equals P(x\n) when the its parameter value is the set n)} together 

with n, and more generally for any A C v is expressed as P^ ;A ^(x | l) whenever 
the CDF’s are the ambiguities { A(ip ; Z, x 1 , x 2 ) : x 1 , x 2 6 £)}. From now on, unless 
explicitly stated otherwise, we will assume that we axe restricting attention to 
an H for which Coroll. 1 holds. 
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(v) The second premise 

Having rewritten the first premise this way, we can address the potential problem 
arising when the worldview does not specify the private utility. First consider 
any changes to s that modify the associated set of n for which P(n [ s) is 
substantial. Typically, any such change in the likely n fixes fairly precisely what 
the inducing changes in s are, as far as evaluation of ambiguities is concerned. 
Accordingly, when exploiting the first premise we usually restrict attention to 
scenarios in which Vr € suppP(r | s) we can approximate 

f dnP(n | s)P^ (x | n) = 

/ d nP(n | s)P^’^(x | n,s). 

We refer to this approximation as the second premise. Note that it holds 
exactly if n contains a specification of and P(x | n) only depends on the 
associated ambiguities, {A{'j p ) n, x l , x j )} = {A(g JfS ; n, x\ x- 7 )}. So if we can treat 
the system as though this were the case, on average, then the second premise 
holds 28 A semi-formal example of a more general situation where the second 
premise holds is presented in App. F. 29 

The following corollary of the second premise is often useful: 

Corollary 2 Where V is any utility function, h £ r\ any coordinate , and W € Q, 
any non-£ coordinate, 

E(V | h,s) = J dndW P{W \ s)P{n | W, s ) E^-^iV | n, s, h, W ) 

Often this result can be used in conjunction with CorolL 1 to analyze the impli- 
cations of various choices of s. As an example, in many situations (e.g., in very 
large systems) changes to p's private utility will have relatively little effect on the 
rest of the systeru, i.e., will have minim al effect on the distribution over r values. 
Accordingly consider s a and s b that vary only in that choice of p’s private util- 
ity 30 , in a situation where this implies that P(r \ s a ) = P(r | s h ) == P(r | s ab ). 

28 Conversely, if a is ‘perniciously chosen” to always force n to equal n' for any s, where 
n f gives no information about the likely values that s is inducing of gp S at the various r, 
then f d nP(n | s)pM(x \ n) = P(x | n') and does not reflect the ambiguities determining 
/ dnP(n | (x | n,s) = P^>^(x | n', 5 ). In such a situation the second premise will not 

hold. This is similar to the situation with the first premise; in both an adversarially poor match 
between the learning algorithm and the learning problem at hand confounds our premise. 

29 If it weren’t for the second premise, we would have to work with P(r | n) rather than 
P(r | n, s) in evaluating ambiguities. This would then require specifying a prior P(s), reflecting 
“prior beliefs” of what the private utility is likely to be, among other aspects of s. Specifying 
a prior over such a space and then integrating against it can be a fraught exercise. In essence, 
the second premise allows us to circumvent this when averaging over n, by setting that prior 
to a delta function about the actual s. Nonetheless, it is important to note that we do not 
need a hypothesis as powerful as the second premise to do this; the second premise is only 
used once, in the proof of Coroll. 3 below, and a significantly weaker version of it would suffice 
there. We present the “powerful” version instead for pedagogical clarity. 

30 Formally, our presumption is that V z a € s a , z b € s b , <T~ g (z a ) — a- g {z b ). 
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Let V be a utility function, so that N p y is as well. Then for both s = s a and 
s = s 6 , by using Coroll. 2 with Q, = p and r\ — 0, we establish that 

E(N Pt v | s) = J drdnP(r \ s ab )P(n | r, s) (N p y \ n, r, s). 

So by Coroll. 1 , taking X = v n a, k = p, and = 7 ^ if separately for 

each r for which P(r \ s ab ) is substantial, 

A(V(.,r),j fi ;n a ,s a ) x 1 ,x 2 ) < A(K(., r), 7 p ; n b , s b , z 1 , £ 2 ) , 

(for all (a: 1 ,^ 2 ) pairs, and for all ( n a ,n b ) such that both P(rz a | r, s a ) and 
P(n b | r, s 6 ) are substantial) we can conclude that E{N p y I s a ) > E(N p y \ s b ). 
This approach can be used even if the coordinate utility V is factored with 
respect to G but the private utility is not. Note also that if we take V — 
and have be factored with respect to s b, then our reasoning implies that 

The first two premises can also be used to analyze the effect on agent p of 
changes to the other agents. In addition they can be used to analyze changes 
that amount to a complete redefinition of the agent (which changes we can 
implement by inserting commands in the value of the agent J s worldview that 
change how it behaves), or more generally, a coordinate transformation [22]. 
Indeed, by those premises, H , and P(r \ n, s) parameterize P(x | n). In 
particular, say a = cr lp C i/, H has no direct dependence on n not arising in the 
ambiguities, and we take P(r | s) to be uniform. Then for fixed H , all aspects of 
the learning algorithm are set by 7 ^, P(n | r, 5 ), and the associated ambiguities. 

More generally, once we specify P(r | s) in addition to these quantities, we 
have made all the choices available to us as designers that affect term 3 of the 
central equation. In principle, this allows us to solve for the optimal one of those 
four quantities given the others. For example, for fixed 7 ^, if, and P(rj s), we 
could solve for which P(n | r, s) out of a class of candidate such likelihoods 
optimizes expected intelligence . 31 

The rest of this paper presents a few preliminary examples of such an ap- 
proach, concentrating on changes to s that only alter one or more agents’ private 
utilities, where only very broad assumptions about P(n \ r, s ) are used. These 
are the scenarios in which the premises have been most thoroughly investigated, 
and therefore in which confidence that if etc. do indeed capture the totality of 
a learning algorithm is highest. 

(vi) The third premise 

As just illustrated, for some differences in s (namely those that only modify 
private utilities) , we can simplify the analysis to involve only a single s-induced 

31 More formally, where a C <j u sets the likelihood P(n | r,s r ho,Sv), we could solve for the 
s u optimizing expected intelligence. 


22 


distribution over r’s (namely P(r j s ab )). The analysis still involved different 
distributions over ra’s however, one each of the two s’s (in the guise of the two 
distributions P(n | r, s)). Moreover, to calculate expected intelligence for a given 
s we must average over n, and usually changes to s change P(n j r, s) in a way 
difficult to predict. 32 Therefore to exploit the first two premises to deter min e 
which of the two s’s gave better expected intelligence, we had to have a desired 
difference in ambiguities hold for all pairs of n’s generated from the two s’s, an 
extremely restrictive condition. 

One way around this would be to extend the analysis in a way that only 
involves a single s-induced distribution over n’s. To see how we might do this, 
fix r, x 1 , and x 2 , and consider a pair s a and s b that differ only in the associ- 
ated private utility for agent p. where those two utilities are mutually factored. 
Train on g s b , thereby generating an n according to P(n | r, s 6 ), and thence a 
distribution over r 7 , P(r f | n), which in turn gives an ambiguity between values 
of the private utility at x 1 and x 2 and therefore an expected intelligence. Our 
choice of private utility affects this process in three ways: 


1) By affecting the likely n, and therefore P(r f | n). 

2) By affecting how well distinguished utility values at x 1 and x z are for any 
associated pair of r ! values generated from P(r' | n). If P(r r | n) is broad and/or 
the private utility is poor at distinguishing x 1 and x 2 , then ambiguity will be 
poor. 

3) By providing one of the arguments to H which (given the utility, and 
along with the ambiguities of (2)) fixes the distribution over intelligences. 


In the guise of Coroll. 1 (with A — is, k = ft = p, ip a = g sa = V a , and 


yj~ 


y s > = 


t rh\ a .1 n j. r j_T — . rr. ^4- T-T . 

V j , biiC ill |J1 tr ili CCIliioClUlS tuc 11 


with the second premise (in the guise of Coroll. 2, with f) = p), we see that 
the first two premises concern the last two effects of the choice of private utility 
on expected intelligence. They say nothing about the first effect of the private 
utility choice though. 

It is typically the case that the first effect will tend to work in a correlated 
manner with the last two effects. That is, if for some given n generated from 
the utility g_p S a results in higher intelligences (e.g., because it is better 
able to distinguish utility values than is gp S b), it is typically also the case that 
if one had used gp S a to generate n’s in the first place, it would have resulted in 
more informative n, and therefore P(r r | n) would have been crisper, leading to 
a better ambiguity and thence expected intelligence. 

We formalize this as the third premise. 33 


32 For example, in a multi-stage game (see App. D), in general changing s causes our agent 
to take different actions at each stage of the game, which usually then causes the behavior of 
the other agents at later stages to change, which in turn changes p's training data, contained 
in the value of n at those later stages. 

33 An alternative to the version of the third premise presented here that would serve our 
purposes just as well would have all distributions conditioned on some b E /3 C cr (e.g., (r, s)), 
rather than just on s. One could also modify the hypothesis condition of the third premise by 
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Say that s a and s b differ only in their associated private utilities, and that those 
utilities are mutually factored . Then 

J d nP(n | s b )E^ v ^\N p | n,s b ) > J dnP(n \ s^E^^iNp \ n,s b ) 
f dnP(n | s a )E^^\N p | n, s a ) > J dnP(n | s b )Eks»^](N p | n , s b ). 

Together with Coroii. 2 this results in the following: 

Corollary 3 Say s a and s b differ only in the associated private utility for agent 
p, and that those utilities are mutually factored. Then 

f dndrP(r \ s b )P(n | r, s b )E&“ (N p | n, r, s b ) > 

fdndrP(r | s b )P(n j r, s b )E^^(N p | n,r,s b ) 


E{N p ^ a | s a ) > E(N p , g _ sb \ s b ). 

If, Vr,A(g PjSb (t,r),g J>iS < 1 -,n,x 1 ,x 2 ,s b ) > A{gp s< ,; n,! 1 ,^ 2 , s b ) (for all 
(x 1 , x 2 ), and for all n such that P(n [ r, s b ) is substantial), then by Coroll. 1 the 
condition in Coroll. 3 is met (take A = i/C\ a and k — p, as usual). So by Coroll. 
3, in such a situation we can conclude that E(N Pj9ga j s a ) > E(N p ^g sb | s b ), i.e., 
that for fixed r, s a has better term 3 of the central equation than does s b . This 
is the process that will be the central concern of the rest of this paper: inducing 
improved ambiguity, and then plugging the first premise (in the guise of Coroll 
1) into the second and third premises (combined in Coroll. 3) to infer improved 
expected intelligence. 

In particular, again consider the situation (discussed in the subsection on 
the first premise) where P{r | s a ) = P(r \ s b ) = P(r | s ab ), and assmne this also 
equals P(r J s b ). If separately for each r for which P(r j s afc ) is substantial, and 
for all associated n for which P(n | r, s a6 ) is substantial, 

Mgp,s >> (•> r ),9p^ ; z\z 2 , s b ) > A(g fitSb (.,r),g PtS *;n,x 1 ,x 2 ,s b ), 
then we can conclude that 

E(N Pti ' a | a“) > E(N p ^ b | s b ). 

replacing s b throughout with some alternative s*, and our results would still hold under the 
substitution throughout of s b — ► s* . Similarly one could change the integration variable n 6 v 
to some other coordinate l £ A C v. For all such changes the results presented below — and 
in particular Coroll. 3 — would still hold; the important thing for those results is that each 
ambiguity arising in the integrand of the left-hand-side of the hypothesis condition of the third 
premise is evaluated with the same distribution over r 1 and r 2 as the corresponding ambiguity 
in the right-hand-side. For pedagogical clarity though, no such modification is considered here. 
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Of course, in practice this condition won’t hold for all such r and n. At the same 
time, Coroll. 3 malms clear that it doesn’t need to; we just need the associated 
integrals over r and n to favor s a over s b . 

(vii) Example: The collapsed utility 

As an example of how to use Coroll. 3, consider the use of a Boltzmann learning 
algorithm for our agent [25], where s b is our original s value. With such an algo- 
rithm, constructing a new private utility by scaling the original one (i.e., chang- 
ing s ) is equivalent to modifying the learning algorithm’s temperature parame- 
ter. Now say that for any pair of moves, the ambiguity for s b and any probable 
associated worldview n b is zero for all negative y values. Then changing s by low- 
ering the temperature will monotonically lower A{g fi S b (£, r), g^ s ; n 6 , x 1 , x 2 ). Ac- 
cordingly, doing this cannot lower expected intelligence, only increase it. (Note 
that the new private utility is factored with respect to the original one, so this ef- 
fect of changing s also holds for expected intelligence with respect to the original 
private utility.) 

Now consider the following theorem: 

Theorem 5 Fix n, s°, s 6 ,r G supp P(. | s b ) and a function U : x € £ — * 9T 
Stipulate that 

i) Vi.i'e £,sgn[f/(x, r) - U(x',r)} = sgn[g s t(x, r) - g s „(x', r)]; 

ii) V r ! £ suppP(. j n), there exists two real numbers A r > and B r > < A T > 
such that g s b (x, r') takes on both values — but no others — as one varies the 
xei; 

Hi) for all such r ' g sa (x,r ; ) = 0 if A r * = B r * , and equals °^ er ~ 

wise , and suppP(. | n), g $a is factored with respect to g s b ; 

iv) for each pair of moves , for at least one move of that pair , x* , 3 y* such 
that P(g s a(x*,p) = y\n) = S(y - y*). 

Then V x^x 2 , A(U,g s a] n, x^x 2 ) has purely non-negative support. 

(An analogous version of this result holds if instead we take g_ sa {x,r f ) — 1 
whenever = B r *.) 

Condition (i) of Thm. 5 can be viewed as a weakened form of requiring that 
U and g s b be factored. In particular, it trivially holds for U ~ g s b , or (due to 
the fact that g s a is a difference utility with lead utility 9 s b )U — g s a . Conditions 
(ii) and (iii) mean that for each r', the values of g s a (x, r f ) as one varies x are 
those of g s b “collapsed” to one of the two values 0 or 1. However for fixed x, 
which of that pair of values equals g s a(x, r f ) can differ from one r f to the next. 

There are many situations in which condition (ii) of Thm. 5 holds with 
g s b = G. One example is a spin glass with G given by the Hamiltonian. Another 
is the simple spin system where G{z) = sin(7rn(z)/2), n(z) being defined as the 
total number of spins in the up configuration. 
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Condition (iv) means that given worldview n, context r, and a pair of moves, 
there is no room for uncertainty in the value of the private utility at x * — it 
must equal (the typically unknown value) y* there. (Note that which element 
of the pair of moves is this special x can vary with n and/or r.) This will 
often be the case if, for example, n was generated from g sa) and the agent’s 
(n-based) “prediction” for the utility value of the particular move it actually 
ends up making is both unambiguous and correct. In particular, such prediction 
accuracy often can be induced by having all the other agents readily “freeze” 
into a static background. In turn, as an example, those other agents are likely 
to freeze if they all use Boltzmann learning algorithms with their temperatures 
set low enough, and with the windows they use to estimate the utilities of their 
possible moves short enough. 

We call the difference utility g s a in Thm. 5 the collapsed utility (CU), 
and say that it is formed by collapsing g s b , since for fixed r l it is formed by 
collapsing all the values g sb (x,r') takes on as one varies x, to either 0 or 1. 

When the conditions in Thm. 5 hold the ambiguity will shrink monotonically 
as the CU is scaled upwards. As an example, consider a Boltzmann learning 
algorithm in the scenario discussed at the end of the previous subsection, where 
in addition the conditions in Thm. 5 are met for private utility set to the CU. As 
the temperature parameter of that algorithm shrinks the associated expected 
intelligence cannot decrease, and should in particular eventually exceed that of 
g s b. 3A Therefore for the choice of g sb = G , the value of G induced by using CU 
as the private utility with a low enough temperature should be larger than that 
induced by using the team game at any temperature. 

3 The Aristocrat and Wonderful Life Utilities 

In this section we illustrate a general set of techniques for changing the pri- 
vate utility so as to monotonically lower unordered ambiguity conditioned on 
a particular n. As discussed above, when plugged into Coroll. 3 such improved 
ambiguities can cause the new private utility to have better expected intelligence 
than the original one. 

The analysis will be closely analogous to that behind the use of Fisher’s 
linear discriminant in statistics. We will start by restricting the analysis to 
distributions obeying a linearity condition. This is essentially an extended form 
of assuming Gaussian distributions — such an assumption being the starting 
point of the derivation of Fisher’s linear discriminant. We will then exploit 
Coroll. 3 to derive “learnability” as a measure of the quality of a private utility 
(as far as term 3 in the central equation is concerned). Formally, learnability 

34 FormaIly, the fact that ambiguity for g s a has purely non-negative support does not mean 
that the ambiguity for g s b has a support that extends to negative values. In practice though, 
that is the case for the vast majority, of n € suppF(. | s b ). Even so, we cannot conclude that 
the ambiguity function for g s a, extending over all y, is less than that for g $ b- We can conclude 
that the reverse does not hold though. And again, in practice, the discrepancy in supports 
usually does mean that the ambiguity function for g s a is less than that for g s b , so that we 
can apply the first two corollaries premises. 
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is identical to the Rayleigh coefficient, just expressed in a different setting. 
Completing the analogy, whereas with the Fisher discriminant one strives for 
coordinate tr ansf ormations of a data set giving a large value of the associated 
Rayleigh coefficient, at the end of this section we demonstrate transformations 
to the private utility giving a large value of the associated learnability. 


(i) Learnability 

We begin by considering the first order expansion of the distribution of one 
utility in terms of the distribution of another utility: 

Theorem 6 Fix 1,1' € AC , an x-ordering U, and two utilities V a and . 

Vb, where 3 K e IR -1 " and h : £ — ► fR such that 

Pv a (y 1 , y 2 ; l' , x 1 , x 2 ) = P K v*+/i (j/ 1 , y 2 ; l , x 1 , x 2 ) . 


Then V y , 

A[y;U,V a] l',x\x 2 } = A 


^+t u (x\x‘) 


( h(x 2 ) — h(x 1 )\ . 

I K J’° 


;U,V b :l,x\x 2 


So if in addition to the condition in Thm. 6, V y, 


K 3- tjj{x , x ) ^ ^ 




< A[y; U, V b ; l, x 1 ,! 2 ], 


then it follows that A[U, V a : V , x 1 , x 2 ] < A[U , v\\ Z, r,x 2 ]. 

We will sometimes find it convenient to put subscripts on K and/or h explic- 
itly giving the values of l',V a ,l, V5, x 1 and/or x 2 , in that order. For example, 
in Fig. 2 we refer to Kv,u > to mean K when V a = V and H =' U. 35 

It is often the case that “to first order”, changing from V = Vj, to V = 
V a doesn’t change the shapes of any of the associated distribution functions 
P(V(x) = v | l) (one such distribution for each x). Primarily, all the change 
does to those distributions is separately shift them, and/or contract them all 
by the same factor. 36,37 The condition in Thm. 6 is (a slightly weaker version 

35 Note the following algebraic rules concerning such sets of distributions that are linearly 
related: 

K h,Vi,h iVs = ^ 1 , v 1 ,z 2s v 2 ^2 > v 2 ,z 3 ,V3; 

Vi ,12,^2 f/^Z2>V2i UjVj) 

^i,Vi,z 3 ,v 3 = F h , VlJ i 2 ,v 2 h hi v 2 ,i S 7 v 3 - h h,v ly i 2 ,v 2 ; 

h h,YL,i2,V3 = -hhMhM/KiiMJiiVi’ 


36 This is particularly common in situations where there are extremely many possible V 
values, densely packed together. 

37 Note that a linear relationship between utilities is a sufficient but not necessary condition 
for a linear relationship between the distributions of their values. 
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of) the requirement that this property holds exactly, even if we also switch 
from l to V at the same time (and therefore change the underlying probability 
distribution over z). The general effects of expansion or contraction of the utility 
on the associated ambiguity are illustrated in Fig. 2. 

Thm. 6 tells us in particular that when its condition is met along with the one 
mentioned just following its presentation, then for K = 1 and t u (x 1 ,x 2 )[h(x 2 ) - 
/i(x x )] negative, then changing from (V&, Z) to (V ai V) improves ambiguity. More- 
over, the degree of that drop grows with increasing magnitude of (h(x 2 ) — 
h{x 1 )}/K. zz In the usual way, for Z = Z', A = i/Hcr, F a = g_ s t. and V& = g s . where 
s and s f only differ in their private utilities, we can exploit this phenomenon in 
concert with Coroll. 1 and then Coroll. 3 to improve term 3. To that end we 
start with the following: 

Theorem 7 Say that the condition in Thm . 6 holds for the quadruple (Z', V a , Z, V&) 
with the same K : h Vx 1 , x 2 . Then 

i) where f is any distribution over x, 


K — 


J dx f(x) Var(Vq; x) 
f dx/(x) Var(Vi>;Z,x) ’ 


Now define 


A f (U; Z", x 1 , x 2 ) 


E{U] l”,x l )-E(U'fi",x 2 ) 


where Ef^(Vax(U] Z",f)) = /dx /(x) Var(Z7(x, p) | Z"). Then 
ii) - ' 


h(x 2 ) — ^(x 1 ) 

~K 


oc A/(H;Z,x 1 ,x 2 ) - Af(V a ; Z', x 1 , x 2 ), 


where the V a -independent proportionality constant is 




We call b&AlzMs. 1 the (ambiguity) shift and Af(U; l. x 1 , x 2 ) the learn- 
ability of U for x 1 , x 2 , and l. 39 As a particular example, for f(x) = (1/2 )[S(x — 
x 1 ) + S(x - x 2 )}, 


[A f (U-,l",x\x 2 )] 2 = 2 


Var(U; l”, x 1 ) + Var(17 ; Z",x 2 )' 


( 2 ) 


Note that \Af(U;l",x 1 ,x 2 ) \ is invariant under affine transformations of U. Typi- 
cally we are interested in the case where sgn[E(V^ — V 2 ; l, x 1 , x 2 )] = sgn[£(V r a 1 — 

38 A similar result holds if we instead consider a fixed pair (x 1 ,! 2 ) and associated K x i x 2, 
so that the expansion factor can vary with moves, just like the offset factor h. 

39 This latter is a slight modification from the definition used in our previous work. 
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V^ 2 ; Z, x 1 , x 2 )] = t V b (x 1 , x 2 ), so that we can use leamability to evaluate the offset 
term in Thm. 6, ^(x^x 2 ) j ^ x ) j. 

Intuitively, the leamability of U reflects its signal-to-noise, as far as agent p 
is concerned, in that agent’s process of “choosing its move” . This is because the 
numerator term in the definition of leamability reflects how much (the expecta- 
tion of) that utility varies as one changes the agent’s move x with the context 
held fixed. In contrast, the denominator term reflects the (average over x of) 
how much U varies due to uncertainty in the context while keeping the move x 
fixed. 40 

The following results provide a geometric perspective on the expressions in 
Thm. 7: 

Theorem 8 Say that the condition in Thm. 6 holds for the quadruple (V, V a: Z, V b ). 

i) If both V a and Vs are difference utilities with the same lead utility and j3 = 

1, while both P(r';Z) = P{r';V) and A f (V b ; Z,x\x 2 ) < A f (V a ; Z', x\ x 2 ), 
then K < 1. 

ii) Let {V a ,V} be an equivalence class of (VJ) pairs all related to (V b /t) as 
in Thm. 6. Then the leamability of those pairs multiplied by tu^x 1 ^ 2 ) 
is a shrinking function of the value of the associated ambiguities at the 
origin. In addition , across all pairs in that class that share some particular 
leamability value , K is inversely proportional to the slope of the ambiguity 
of that pair at the origin. 

Hi) Say the condition (ii) also holds for the quadruple (Z*,I^- = pV a ,l,V b ) 

( though potentially for a different K and/or h ), where P(r / ; T) — P{r f \ V ) . 

Then Af(V a *;l* ,x x 5 x 2 ) and Af(V a ',l f ^x 1 ,x 2 ) are identical V x 1 , x 2 , as are 
the associated shifts , while Ki~y am jy h = f3Ki\v a ,i,v b - 41 

iv) IfK<l andA f (V a ;l',x\x 2 )>A f (V b ;l,x\x 2 ) (K > 1 and A f (V d ; Z', x\x 2 ) < 
A/(Vf,;Z, x 1 ^ 2 ), respectively), then the maximal slope of A(V a ; l ' , x 1 , x) is 
greater than ( less than, respectively) the maximal slope of A(V b ; Z, x 1 ,x 2 ). 

To understand Thm. 7 in terms of ambiguities, for pedagogical simplicity 
consider making changes to a utility V without any corresponding changes to 
the value of A (and therefore none to the underlying probability distribution 
over z). First note that such a change applied to the scale of V doesn’t change 
how weighted the associated ambiguity is to positive y values. It doesn’t change 
“how far” V{x 1 ) — V(x 2 ) is from zero, on average. This “weight to positive 
y values” is reflected in the value of |A/| (which is invariant with respect to 
such rescalings), and therefore (by Thm. 7(ii)) is also reflected in the value of 

40 Low leamability is not only a problem for agents with poor learning algorithms. Even for a 
Bayes-optimal learning algorithm, if the “signal to noise” of the private utility is poor, then the 
agent’s intelligence for the actual r at hand can readily be far less than 1. (Bayes-optimality 
only means that x is set to maximize E(g s | n,x), not to maximize g s (x, r).) 

41 Trivially, the condition in Thm. 6 holds for (/', V a * , Z, V b ) if it does for (V , V a , l, V b ). In 
addition, \f{V a .\V ,x 1 ,x 2 ) = A } (V a -,V ,x 1 ,x 2 ) while AV,v a . ,t,V b = P K l',V a ,l,V b - 
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Figure 2: The leftmost solid line shows an ambiguity A(y,V;l^x 1 ,x 2 ). The 
dotted line shows A(y\V f \l,x l ,x 2 ) for V f = aV, 0 < a < 1. Ky> y — 
and learnability of V f is the same as V’s. The dashed line shows the dot- 
ted line right-shifted by t u (x 1 ,x 2 )[h(x 1 ) ~ h{x 2 )) > 0, i.e., the ambiguity 
A(y;U;l,x 1 ) x 2 ) for U = aV + h. (Since we have not changed s, Thm. 6 must 
apply.) A/(Z7;Z,x\x 2 ) > A/(L'; Z, x 2 ,x 2 ). Finally, the rightmost solid line de- 
picts the dotted line expanded back to the scale of the leftmost solid line, i.e., 
the ambiguity of U* = (3U where /3 = 1 jKy y, so that Kxj*y — 1- As with the 
previous one, this rescaling from W to T does not affect the learnability 


tuixA.x 2 ) However such a rescaling can still be useful in how it 

“stretches” the CDF. To see how, note by Thm. 8(iii) that if V has better learn- 
ability than some other utility Z7, such stretching of V may provide a new utility 
V* such that in addition Ky> y u = 1, which means that V 1 has better ambiguity 
than U (in light of Thm. 8(iii)). 42 In other words, to change the learnability 
we must induce a rightward offset in the (potentially scaled) ambiguity of V . 
Having done that, a subsequent rescaling can give us an aggregate K equal to 
1 (without changing learnability), and thereby provide a final utility whose am- 
biguity lies everywhere below that of U . The value of that offset is given by the 
(/^-independent) ambiguity shift. (See Fig. 2.) 


(ii) Learnability and term 3 

Plug Thm. 7 into Thm. 6, with U in Thm. 6 set to the x-ordering given by 
Vb( r). This shows that after appropriate rescaling of V a , the triple (V a , V) has 
better ambiguity than does (V b , l) if it has better learnability. 43 If we plug that 
fact into Coroll. 1, we establish the following: 

Corollary 4 Fix r, Z, V , V a and V^, where A C v, as usual Say 3 K £ : 

£ — ► fR, such that Va; 1 , x 2 

i) Pv a (ySy 2 ; l',x l ,x 2 ) = PKv i +h{y l ,y i \i,x 1 ,x l ); 

and 

42 Note that such rescaling amounts to changing the temperature parameter in a Boltzmann 
learning algorithm. 

43 Note that this rescaling is done before we invoke the third premise. In this way we will be 
able to exploit that premise to do rescaling without invoking the assumption in Thm. 8(iii) . 
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a) %(., r)(z\ X 2 )Af(V a ] l', X 1 ,! 2 ) > tv^.rfix 1 ,! 2 ) hf{V b \ l, x\x 2 ). 

Then by appropriately rescaling V a we can assure that 

E^ x \N PtVi | r,V) > E^(N p y b | r,l). 

Consider changing the private utility from V 5 to a F a which is factored with 
respect to V b . Then Coroll. 4 means that if this increases the leamability (in the 
^-ordering preferred by V b (., r)) of one’s private utility, then typically it results 
in higher expected intelligence, for the optimal scaling of that private utility. 
More precisely, express Coroll. 4 for A = v D a and l — V = (rz, s) and then 
plug it into Coroll. 3 with s b ~ s,g s a = V a and g s b = V b , where s a and s b differ 
only in the associated private utility for our agent, and V a and V b are mutually 
factored. Then we see that if learnability is higher with s a than with s b (in the 
^-ordering preferred by V b (., r)) for enough of the n for which P(n | r, s b ) is 
non-negligible, then s = s a gives a higher expected intelligence conditioned on 
r and s than does s — s b (each intelligence evaluated for the associated optimal 
scale of the private utility). 

As an added bonus, often the higher the leamability of a private utility, 
the more “slack” there is in setting the parameters of the associated learning 
algorithm while still having an ambiguity that’s below that of some benchmark, 
low-leamability private utility. In other words, the higher the leamability, the 
less careful one must be in setting such parameters in order to achieve expected 
intelligence above some threshold. In particular, the greater the ambiguity shift 
in Coroll. 4, the broader the range of scales f3 for which f3V a has greater expected 
intelligence than does V b . So by using private utilities with increased leamability 
often it becomes less crucial that one exactly optimize the learning algorithm’s 
internal parameter setting the scale at which the algorithm examines utility 
values. This phenomenon can be amplified via “construction interference” , for 
example as in the following result. 

Corollary 5 Fixr and two sets of utility- (X-value) pairs, {V t ,l t } and {V* ,l t *} , 
indexed by t and t*, respectively . Assume all quintuples (r, Z*-, V* , l t , V t ) obey 
Coroll 4(i),(H) with V a ~ V* ,V b = V t} etc. For pedagogical simplicity, also take 
sgn^x 1 , r) - V b {x 2 , r)] = sgn^fc 1 , r) - V a (x 2 , r)] = m, 

Sg4EW - V b’> 1, s 2 )] = sgn^ 1 - V 2 ; V, x\ x 2 )} = to', 

and m = m! . 

i) Define 


^t.x 1 ,x 2 

where as usual f is a fixed but arbitrary distribution overx , and we assume 
A m . )X i iZ 2 > 0 Vt. t*, x 1 , x 2 . 


= {Af(V m ;lf,x 1 ,x 2 ) - AfCVt-Jt,! 1 ,! 2 )}^ J d xf(x)Var(V t ;l t ,x), 

= min( 2 / : A(y;V t (.,r),V t :l t ,x 1 ,x 2 ) = 1), 

= max(y : A(y; V t (.,r),V t ; It,! 1 ,! 2 ) = 0), 
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ii) Define K t y — Ki t . y~,i t y t > and then define the subintervals of 91 (one for 
each (t, x 1 ,# 2 ) triple) 


and 


1 


J t,t* ,V* ja : 1 ,^: 2 — 


Bt>x x ,x 2 


D- 


t,x x ,x 2 


Kt,t* ^t,X X ,X 2 ,X 1 ,X 2 Dt,X X ,X 2 ,x x ,x 2 

if A,* 1 ,X 2 < ^t,£* jX 1 ,X 2 ) 


1 


■h 


^ijX 1 ,x 2 


-,oo) 


, OO) 


^t,X X ,X 2 H" ^t,t*,X X ,X 2 

if ~~ ^ty,x\x 2 < A,x\x 2 < 
f ^ A.X^X 2 v 

A,t* A,* 1 ,* 2 + A tjt ^, x i jX 2 

otherwise , 

— ,x 2 ,t* , V* ,x* ,x 2 * 


in) Define Lt~ y* =U t L ti t*y*. 

Then for every t*, Vp £ L t *y* } 

E l0V ‘’ X HN v . | r,/ t .) > I r, / t ). 


Note that > 0 always, since m = m! for (Z t , V t ). Accordingly, L t *y* is 

never empty, always containing U tj^~ at least. 44,45 

To help put Coroll. 5 in context, apply Coroll. 4 to the scenario of Coroll. 5. 
This estabhshes that for any t*, 3/3 6 A-,y* such that E^ v *' x \Npv I r, k~) > 
maxtE^ Vt]X ^ (Npv t I r, Z t ). Note also the immediate implication of Coroll. 5 that 
V/3 € C\t*L t ~y * , 

mint*E^ v ]X \Ny* | r, Z t *) > (JVy t | r, Z t ). 


As an example of Coroll. 5, take A = v fl cr, have Z f * equal some fixed 
1* Vt*, F* = g s + , and F = g St Vt. Have real- valued t e [h > 0, tf\, where V t = 
F tl “-. So assuming A/(F*; Z*,# 1 ,# 2 ) > A /(F; Z^x 1 ,# 2 ) Vx^x 2 as usual, the 
range in the logarithms of P for which E^ v *'' X \Nv* [ r , T) > mintE^ Vt]X ^ (Ny t | 
r, Z t ) is greater than or equal to ln(t 2 ) - ln(ti). 46 


44 If (unlike in Coroll. 4) the value of K can change with the (re 1 , x 2 ) values, then those 
indices must be added to K's subscripts. In this case the conclusion of Coroll. 4 need not 
hold; Lf,v * can be empty. 

45 A subtle point is that in situations where D t x i x 2 > 0, we can increase the scale of 
Vt as many times as we want and assuredly improve its ambiguity each time. (This is not 
something we can do in the other situations.) Accordingly, if every instance going into L t * y • 
is such a situation, then our conclusion that rescaling V* can assuredly give better expected 
intelligence than Vt is a bit irrelevant; in this scenario we can also rescale Vt to assuredly 
improve its expected intelligence. 

46 To see this, note that t sets the scale of Vt, just like ft does for V* . Furthermore, K t = 


Kt,f = if P(r'-,n 

contained in L t y y , equals 


= P(r']lt) Vr', t (cf. Thm. B(iii)). So 1 /K t , which we know is 
. Now apply Coroll. 5. 
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As another example, choose {l t } = {Z t -} = {n 6 supp P(y | r, for 

some set of a values {s* }, with V t = V n s i = g $i Vi. Also presume that V/?, there 
is a design coordinate value such that g s p — j3V*. If we now plug the conclu- 
sions of Coroll. 5 into Coroll. 3, we establish that Vi, (3 £ n n€supp p^jr, **)£»,** > 

| r, s'), 

and therefore V/3 E C s i )n £ SU p P p( J/ j r>5 t).L nj $i i v r *, 

|r,^)> TWinst t n€supp I ^ ) 


(iii) Aristocrat Utility 

In general, there is no utility that is both factored with respect to the world util- 
ity and has infinite leamability. 47 The following result allows us to solve for the 
private utility that maximizes leamability, and thereby find the private utility 
for agent p that should give best performance under the first three premises: 

Theorem 9 


i) A utility Ui is factored with respect to U 2 at z iff V z' 6 p(z) = r, with 
x = t/i(x,r) — F r (U 2 (z')) — D(r), for some function D and some 

r -parameterized function F r with positive derivative . 

ii) Forfixedl € A C v, r, x 1 , 2 ? , andF , theD that maximizes A f(Ui; l, x 1 ,x 2 ) 
is the (l, x l ,x 2 ) -independent quantity Ef( x )(F r (U 2 (£, r))). 


Z ZZ J jL fie J Uilit wtc uoJt/otui/W* i/£w!ir£6Tv 1^2 


iW rr. 


argmin^r 


E f{x) {Vax(U 2 ;l,Q} 1 

£ /(ll) , /(x2) {Vax((Fi -F 2 )^ 1 -r-2);Z,e,^)}J ’ 


where the subscript on the denominator expectation indicates that both x 7 s 
are averaged according to f , and the delta function there means that our 
two F’s (one for each x) are evaluated at the same r. 


A particularly important example of a function F r meeting the condition in 
Thm. 9 is F r {U 2 ) — U 2 - This choice results in the difference utility U\ that takes 
z = (x, r) — ► U 2 (x , r) - E f (U 2 (€, r)). We call this the Aristocrat Utility (AU) 

47 As an example of when having both conditions is impossible, take r E {r 1 , r 2 }, x € 
{x 1 ,! 2 }, and G^x 1 ,^) > G{x 2 while G(x 2 ,r 2 ) > G{xl ,r 2 ). Then by Thm. 1, we also 
must have ^(x 1 ,?- 1 ) > ^(x^r 1 ) and 7 p(x 2 ,r 2 ) > ^(x^r 2 ). Also assume that P(r';l) = 
Sir* —r) Vr,5, so P(U = u; l , x) = 6(u — U(x , r)) always. 

Define A e ^(x 1 , r 2 ) -^(x 1 , r 1 ), C = Jpi * 2 ’ r2 ) “TpC* 1 * ^ -^(x 2 , r 2 ), 

and D 3 t^Cx 1 , r 1 ) ~ 7 p(x 2 , r1 )* AH-5 + C + I? = 0, and both C > 0 and D > 0. 

Take /(x) = 1/2 for both x, so f dx/(x) Var(£7; r, s,x) = [A 2 + B 2 ]/ 4, which by convexity 
> [(A + B)/2)] 2 = [(C + r»)/2)f. In turn, [E^x 1 ) - £(Z7; 1, x 2 )] 2 - [(D - C)/2] 2 < 
[(C + D) J2)] 2 . Combining, by the definition of learnability we see that it is bounded above by 
1. QED. 
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for U 2 at z, AU U2 j(z), reflecting the fact that it is the difference between the 
value of U 2 at the actual z and the average such utility. 

Say a particular choice of /, /', results in conditions (i) and (ii) of CorolL 4 
being met with V b = U 2 and V a = AU u 2 j^ for the choice of A etc. discussed just 
after the presentation of Coroll. 4. Then we know by that corollary that once 
it is appropriately rescaled, using the AU for U 2 as p’s private utility results 
in an expected intelligence with that is larger than is the expected intelligence 
that arises from using U 2 as the private utility. (Note that U 2 and AUp 2> /' 
are mutually factored.) Moreover, by Thm. 9 any other difference utility that 
obeys Coroll. 4(i) (ii) (in concert with U 2 ) must have worse ambiguity than does 
AUjj 2 j 7 5 and therefore worse expected intelligence. 48 

To evaluate AU for some G at some z we must be able to list all z' £ p(z ). 
This can be a major difficulty, for example if one cannot observe all degrees of 
freedom of the system. Even if we can list all such z\ we must also be able to 
calculate G for all those z', an often daunting task which simple observation of 
the actual G(z) at hand cannot f ulfill (in contrast to the calculational needed 
with a team game, for example). 

Even when we cannot calculate an AU exactly though, we can often use 
an approximate AU and thereby improve performance over a team game. For 
example, in an iterated game, at timestep £, r for a particular player i reflects 
the state of the other players it is confronting. In such, a situation, by observing 
r, often we can approximate Ef{g i {£ > , r)) by an appropriate average of the value 
of ^ over those preceding iterations when the state of the other players was r, 
with / being the frequency distribution of moves made by i in those iterations. 
In particular, consider a “bake-off” tournament of a 2-player game in which each 
player in the tournament plays one other player in each round, and keeps track 
of who it has played in the past and with what move and resultant outcome. 
In such a situation, the expectation value for player i confronting player j that 
gives AU g can often be approximated by the average payoff of player i over 
those previous runs where i’s opponent was j. 

On the other hand, even when we can evaluate AU exactly, it may be that 
the conditions in CorolL 4 are badly violated. In such situations increasing 
learnability by using AU will not necessarily improve expected intelligence, and 
accordingly AU may not induce optimal performance. Indeed, it may induce 
worse performance than the team game in such situations. On the other hand, 
there are other modifications to the private utility that (under the first premise) 
may improve expected intelligence in these situations. An example of such a 
utility is the CU, as illustrated in [22]. 

(iv) Wonderful Life Utility 

One technique that will often circumvent the difficulties in evaluating AU is to 
replace p with a coarser partition, having poorer resolution. While this replace- 

48 Note though that in general there may be a utility F r (U 2 ) — F(r) with better learnability 
than AU, for example if F r is non-linear. Note also that whether AUjj 2 jt obeys conditions 
4(i)(ii) will depend on the choice of/ 7 , in general. 
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ment usually decreases learnability below that of AU, it still results in utilities 
that are far more leamable than team game utilities, while (like team games) 
not requiring knowledge of the set of worldpoints p(z) in full. In this subsection 
we illustrate making such a replacement for difference utilities. 

We concentrate on the case where the domain of the lead utility Di is all of 
£, and the secondary utility D 2 = D\{<f>(z)) for some function <j > : £ —+ £ where 
Vz € C,<f> depends only on r, i.e., V r, Vz 7 , z” € r, <f>(z f ) ~ <f>(z ,r ). So specifying 
the utility consists of choosing <j>. While in general we can make the choice that 
best suits our purposes, here we will only consider a particular class of 0 7 s. A 
more general approach might, for example, choose (p to maximize learnability. 
Intuitively, the resulting difference utility is equivalent to subtracting Dj of a 
transformed z from the original D\{z), with the transform chosen to maximize 
the signal-to-noise of the resultant function. See the discussion of Thm. 7. 

Let 7r be a partition of £. Fix some subset of £ called the clamping element 
CL-tt such that Vp e tt, D\ is invariant across the (assumed non-empty) inter- 
section of CL-,,- and p. 49 Define an associated projection operator CL-^(z) '= 
CL-*. 07r(z), which for any p E 7r maps all worldpoints lying in p to the same 
subregion of that element, a subregion having a constant value. 50 Then the 
Wonderful Life Utility (WLU) of D\ and 7 r is defined by 

WLUr^z) = A(z) - Di (CL~n(z)). 51 

To state our main theorem concerning WLU, for any partition of £, tt, and 
any set B C £, define B Pi Tt to be a partition of B with elements given by the 
intersections of B with the elements of tt. Furthermore, recall from App. B that 
given two partitions 7Ti and 7r 2 , 7Tx C tt 2 iff each element of 71*1 is a subset of an 
element of tt 2 . Then the follovving holds regardless of what subset of £ forms C: 

Theorem 10 Let t: and 7r 7 C 7r be two partitions o/£. Then WLUo 1)7r is fac- 
tored with respect to D\ for coordinate C D 7r' V z € C. 

As an example, with p = C D 7r, WLUg,tt is factored with respect to G for 
coordinate p. 

Note that tt' C 7t means that 7 r 7 is either identical to 7r or a “finer-resolution” 
version of tt. So z — ► CL-^ri7r(z), by sending all points in 7r(z) to the same 
point, is a more severe operation, resulting in a greater loss of information, than 
is z — + which can map different points on 7 r(z) differently. So Thm. 

10 means we can err on the side of being over-severe in our choice of clamping 
operator and the associated WLU is still factored. 52 

49 Note that CL-*- automatically has this property, independent of Di, if its intersection 
with each element of 7r consists of a single worldpoint. 

50 Note that both CL-*- and CL-^ (z) axe implicitly parameterized by D±. 

51 Note that if there is some x f £ f such that CL- ff (x, r) = (x',r) V x,r, then WLU is a 
special type of AU, with a delta function /.. 

52 Sometimes WLUq ^/(z) will be factored with respect to G for coordinate C C\ it even 
though 7T 7 C tt. For example, this is the case if G is independent of precisely which of the 
elements of re' contains z, so long as all of those elements are in tt(z). However in general 
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There are other advantages to WLU that hold even when tt = tt' . For exam- 
ple, in general CL- 7r (z) need not lie on the set C (n.b., 7 r and ~tt are partitions 
of C, not C ). In such a case the function G(CL^(z)) : C -+ is not specified 
by the function G(z) : C — > TL In this situation we are free to choose the values 
G(CL~ 1T (z)) to best suit our purposes, e.g., to maximize learnability. 

An associated advantage is that to evaluate the WLU for coordinate C D 7r, 
we do not need to know the detailed structure of C. This is what using WLU 
for the coarser partition 7 r rather than the AU for the original coordinate C Citt' 
gains us. Given a choice of clamping element, so long as we know G(z) and 7 r(z), 


together with the functional form of G for the appropriate suudci6 ui , wc K-lll/w 
the value of WLU< 2 l7r (z). These advantages are borne out by the experiments 
reported in [17]. 


(v) WLU in repeated games 

As an example of WLU, say we have a deterministic and temporally invertible 
repeated game (see App. D). Let the and {0i, 02 , - - • , 0l} be 

two sets of generalized coordinates of C T (not necessarily repeating coordinates). 
Consider a particular player/ agent, and presume that Vf' there is a single- valued 
mapping from r* — ► (u?i, , wj), and one from ( x 1 , r* ) — > (gi, g2, ■ ■ ■ , Ql) 

(both implicitly set by C ). So the player’s context at time t' fixes the values of 
the Ui (defined for time T), and by adding in the player’s move at that time we 
also fix the values of the 0*. Say we also have a utility U that is a single- valued 
function of (tui, tu 2 , . . . , wj, q u g 2 , . . . , qif)- 

Take 7 r to be the partition of £ whose elements are specified by the joint 
values of the Take CL-*. to be a set of z sharing some fixed 

values of {0i , 0 2 , • • . , 0l}- Note that U is constant across the intersection of CL-^ 
with any single element of 7r, as required for it to define a WLU. 

Intuitively, CL-^z) is formed by “clamping” the values of the {0i, 02, • * • , 0l} 
to their fixed value while leaving the {a>i,u> 2 , . . - ,cuj} values unchanged. More- 
over, since r l — * (wi , • • * , wj) is single-valued, we know that any dependency 

of the important aspects of z (as far as U is concerned) on our player’s move 
at time t f is given by (a subset of) the values {gi, < 72 , ■ — , gz,}* (Recall that all 
values x l are allowed to accompany a particular r 4 .) 

Now by Thm. 10, we know that WLUt/,* is factored with respect to U for 
coordinate CC\ 7 T ' for any partition 7r' that is a refined version of 7 r. In addition, 
p 1 ' C 7 r. So WLUc/ )7r is factored with respect to U for the coordinate given by 
C n p f ' = p l \ i.e., it is factored for our player’s context coordinate at time if . 

When the {0;} are minimal in that none of them is a single- valued mapping 
of r f/ (i.e., none can be transferred into the set of {u>i}), we say they are our 

such factoredness will not hold. Even if it doesn’t though, say G is relatively insensitive to 
which of the elements of tt' contains z > over the set of all such elements that are in 7r(z). 
Then WLUQ >7r /(z)- will be quite close to factored for coordinate G C 1 7r. This often allows us 
to be “sloppy” in using WLU’s, by taking tt' to be only those degrees of freedom G Pi 7r with 
“significant impact” on the value of G. 
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player’s effect set [17]. 53 Often a player’s behavior can be modified to ensure 
that a particular set of {#*} contains its effect set for some particular time. 
When we can do this it will assure that some associated variables {uji} specify 
(a partition tt that gives) a WLUq^ for our player’s move at that time that is 
factored with respect to G. 

(vi) WLU in large systems 

Consider the case of very large systems, in which G typically depends signifi- 
cantly on many more degrees of freedom than can be varied within any single 
element of p (i.e., depends more on the value of r than on where the system 
is within that r). So we can write G(x , r) = G±(x, r) -f £? 2 (r) where the values 
of G 2 in C are far greater than those of G 1 , and correspondingly the changes 
in the value of G 1 as one moves across C are far smaller than those of G 2 . In 
such cases, with p = C D 7r as usual, the learnability of G is far less than that 
of WLUg,tt- This is due to the following slightly more general theorem: 

Theorem 11 Let n and it C k be two partitions of Q. Write H(z) = Hi(z) -f 
H2{k{z)), where H is defined over all and consider the agent p—C Pi 71. F-ix 
l e A C v, and define 


M = {max[ifi(z) — H\{z')]} 2 

z^z* 


and 

L~ J dk' dk" P(k';l)P(k"-,l)[H 2 (k') - H 2 (k")} 2 . 

rm - 7 ... ± _£ r 1 . _ j 2 

men inucptuuciu, uj j ; \^±j~ K , z and z , 

A^WLU^jZ^ 1 ^ 2 ) L [T 
A f (H; l.x 1 ,! 2 ) - 2M V M' 

Note that as k becomes progressively coarser and coarser, L shrinks. So such 
coarsening of the clamping element will typically lead to worse learnability. In 
fact, in the limit of k = 0, WLUj*,* just equals H minus a constant. So in that 

53 Sometimes the (<?i, < 72 , - ■ - , qn) value specifying the clamping element of an effect set can 
intuitively be viewed as a “null action”, so that clamping can be viewed as “removing agent p 
from the system”. Intuitively, in this case we can view WLU as a first order subtraction from 
G of the effects on it of specifying those degrees of freedom not contained in the effect set 
(hence the name “wonderful life” utility — c.f. the Prank Capra movie). More formally, in such 
circumstances WLU can be viewed as an extension of the Groves mechanism of traditional 
mechanism design, generalized to concern arbitrary (potentially time-extended) world utility 
functions, and to concern situations having nothing to do with valuation functions, (quasi- 
linear) preferences, types, revelations, or the like. (See [7, 2 , 14, 2 , 10, 16, 8 , 27, 13].) Due 
to its concern for signal-to-noise issues though, this extension relies crucially on re-scaling of 
G. (Indeed, if one just subtracts the clamped term without any such re-scaling, ambiguity 
can be badly distorted, so that performance can degrade substantially [23].)' In addition, this 
extension allows alternative choices of the clamping operator, even clamping to illegal (i.e., 
not € C) worldpoints. This extension also can be used even in cases where there is no action 
that can be viewed as a “null action” , equivalent to “removing the agent from the system” . 
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limit, WLU# }/C and H must have the exact same learnability — in agreement 
with Thm. 11 and the fact that L — 0 in that limit. 

When L greatly exceeds M the bound in Thm. 11 is much greater than 1. 
So if we take H — G and « = 7 r, Thm. 11 tells us that for very large systems, 
setting the private utility to G’s WLU rather than to G may result in an extreme 
growth in learnability. 54 In particular, for A = v fl cr, in large systems it may be 
that L » MV l such that P(l \ s) is non- infinitesimal. Under the first three 
premises, assuming WLUg,* and G obey the conditions in Coroll. 4(i),(ii), this 
means that setting the private utility to WLU will result in larger expected 
intelligence of the agent than will setting it to G. Moreover, since that WLU is 
factored with respect to G, this improvement in term 3 of the central equation 
will not be accompanied by a degradation in term 2. This ability to scale well 
to large systems is one of the major advantages of WLU and AU. 

(vii) WLU in spin glasses 

As a final example, consider a spin glass with spins {bi}. For each spin i let 
be the set of spins other than i, and for each i let hi and F x be_any two func- 
tions such that the Hamiltonian can be written as H(b) = hi(bi, &_*) -F F;(h_i). 
In particular, for 7i(b) = Yljk Kjkbjbk -F JL Hjbj, we can have iq(6_;) = 
T,&i,k^i' H jkbjb k + Y.&i'bijbi, and = HA + Hub f + + 

Hji]bjbi/2. Since at equilibrium b minimizes 7F, and therefore given the equilib- 
rium value of b-i : at the ^-minimizing point bi is set to the value that minimizes 
hi{bi , 

We can view this as an instance of a collective where 7i is the (negative) 
world utility G for a system of “agents” p with move b p , and g p = h p . For all /?, 
at the & that maximizes G, b p is set to the value that maximizes — h p given b- p . 
More generally, hi(b^ b_i) = H(b) — Fi(b-i) is factored with respect to G(b) (cf. 
Thm. 2), with the context for each agent p being &_ p and G = ( being the set of 
all vectors b . So any b (locally) maximizing G also simultaneously maximizes all 
of the - hi . Frustration then is a state where all the agents’ intelligences equal 
1, but the system is at a local rather than global maximum of G. 

Consider a particular spin/agent, p. Embed G, the set of all possible &, 
in some larger space that allows the spin p to take on additional values, and 
redefine C to be that larger space. Let i r be an associated (-partition such that 
p = G n 7r. Take CL^ to be some set off of G. Extend the domain of definition 
of h p by setting h p (CL- n (b) = 0 V b p G. Then WLUg,tt = — h p , i.e., WLU is 
the “local Hamiltonian” perceived by spin p, whereas G is the Hamiltonian of 
the entire system. 

So by Thm. 11, if the number of nonzero coupling strengths between p and 
the other spins is much smaller than the total number of nonzero coupling 
strengths in the system, then the learnability of p’s local Hamiltonian far exceeds 

54 Trivially, since learnability of AU is bounded below by that of WLU, its learnability must 
exceed that of a team game at least as much as WLU’s does. 
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that of the global Hamiltonian. Accordingly, consider casting the evolution of 
the spin system as an iterated game, with each spin controlled by a learning 
algorithm, and each gp S t set to either spin p’s local Hamiltonian at time £, 
or to the global Hamiltonian at that time. (See App. D.) Then since WLU is 
factored with respect to G, we would expect (under the first three premises, 
and assuming conditions 4.1(i)(ii) hold, etc.) that at any particular timestep of 
the game b is closer to a local peak of the global Hamiltonian if the agents use 
the value at that timestep of their local Hamiltonians as their private utilities, 
rather than use the value of the global Hamiltonian at that timestep. 

If we also incorporate techniques addressing term 1 in the central equation, 
then we can ensure that such local peaks are large compared to the global peak. 
Moreover, if we have the spins use a WLU with better learnability, we would 
expect faster convergence still. Similarly, if the spins use AU rather than their 
local Hamiltonians, then since this increases learnability, performance of the 
overall system should improve further still. (Roughly speaking, such a change in 
private utilities is equivalent to having the agents use mean-field approximations 
of their local Hamiltonians as their rewards rather than the actual values of 
their local Hamiltonians.) More generally, any modification of the system that 
induces higher learnability (while maintaining factoredness of the individual 
spins’ private utilities with respect to the original Hamiltonian) should result in 
faster convergence to the minimum of the original Hamiltonian. The foregoing 
is borne out in experiments reported in [24], 
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A Intelligence, Percentiles and Generalized CDF’s 

A useful example of intelligence is the following: 

N P v{z) = J dM*) ®[U(z) ~ U(z')] (A.l) 

with the subscript on the (usually normalized) measure indicating it is restricted 
to z ! £ p{z) (usually it is also nowhere-zero in that region). For consistency with 
its use in expansions of CDF’s, the Heaviside function is here taken to equal 0/1 
depending on whether its argument is less than 0 or not. (Having 0(0) = 0 in 
Eq. A.l is also a valid intelligence operator.) Intuitively, this kind of intelligence 
quantifies the performance of z in terms of its percentile rank, exactly as is 
conventionally done in tests of human cognitive performance. Note that this 
type of intelligence is a model-free quantification of performance quality; even if 
z is set by an agent that wants large N Pi jj and N p p(z) turns out to be large “by 
luck”, we still give that agent credit. The analogous coordinateless expression is 
given by Njj(z) = / d p(z f ) 0[J7(z) — U(z / )] where p runs over all of C. 
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There is a close relationship between CDF’s and intelligence in general, not 
just percentile-based intelligence. Thm. 3 provides an example of that relation- 
ship. For percentile-based intelligence though the relationship is even deeper. In 
particular, coordinateless percentile-based intelligence can be viewed as a gen- 
eralization of cumulative distribution functions (CDF’s). This generalization 
applies to arbitrary spaces serving as the argument of the underlying probabil- 
ity density function (not just Eft 1 ) and does not arbitrarily restrict the “sweep 
direction” (said direction being from — oo to -boo for the conventional case). In 
particular, for the special case of z € 9T 1 and invertible U{.) where \V z U{z)\ = 1 
a.c., \SJ z Njj{z)\ gives the probability density p,(z) and 0 < Nu(z) < 1 V z, just 
like with the conventional CDF for which the underlying space is Dl 1 . (In fact, 
for U(z € 9I 1 ) = z + constant, Njj(z) is identical to the conventional CDF of 
the underlying distribution p(z)f For the more general case, intuitively, U itself 
provides the flow lines of the sweep. 

Percentile-type intelligence is arbitrary up to the choice of measure /x, and 
in a certain sense essentially any intelligence (in the sense defined in the text) 
can be “expressed” as a percentile-type intelligence. As an alternative to these 
kinds of intelligences, one might consider standardizing a utility U by simply 
subtracting some canonical value (like the expected value of U) from U{z). 
This operation doesn’t take into account the width of the distribution over U 
values however, and therefore doesn’t tell us how significant a particular value 
U(z)—E(U) is. To circumvent this difficulty one might “recalibrate” U(z)-~E(U) 
by dividing it by the variance of the distribution, but this can be misleading 
for skewed distributions; higher-order moments may be important. Formally, 
even such a recalibrated functions runs afoul of condition (i) in the definition of 
intelligence. 

One important property of percentile-type intelligence is that with uncount- 
able £ and a utility U having no plateaus in if P(~r | r, s ) = /x r (~r) and is 
independent of r, then P(Njj(z) | s) is constant, regardless of U and //. More 
formally, 

Theorem A.l Assume that for all y in some subinterval of [0.0, 1.0], for all 
r in supp P(. | s ) there exists such that the intelligence N Py u(r, ~r) = y. 
Restrict attention to cases where the intelligence measure fi r (~r) = P(~r J r,s) 
and is independent ofr. For all such cases , P(Nu(z) | s) is flat with value 1.0 , 
independent of both p and U. 

Proof: We use the complement notation discussed in App. B. Write 
P(N p ,u{r, ~r) = y \ s) = J drdVP(r | s)P( V | r,s)P(N PtU (r, V) = y | r,s) 

Next writ e P(N P 'U(r, ~r) = y \ r,s) as the derivative of the CDF P(N Pt u(r , V) < 
y | r, s) with respect to y. Now by assumption there exists a *r such that 
N fi ,u( r , ~ r ) = V- So we can rewrite that CDF as 

P(N p>u {r, V) < N p ,u(r, >) | r, s), 
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where the probability is over *r', according to the distribution P(~r' | r, s). 

We can rewrite this CDF as 

P{U{rrr')<U{r,~r)\r,s), 

by property (ii) of the general definition of intelligence. In turn we can write 
this as 

J d VP( V' | r, s)0(l7(r, >) - U(r, V)) 

= /dVM*rW(r,T)-^V) (by assumption) 
x/(r, >) (by definition of intelligence) 

= 2 /- 

Therefore the derivative of our CDF = 1. QED. 

Intuitively, this theorem says that the probability that a randomly sampled 
point has a value of U < the y'th percentile of U is just y, so its derivative = 1, 
independent of the underlying distributions. Note that both the assumption that 
P(~r j r, s) is independent of r and having /z(~r) = P(~r | s ) is “natural” in 
single-stage games — but not necessarily in multi-stage games (see App. D). 

If the conditions in the theorem apply, then choice of U is irrelevant to term 
3 in the central equation. If we choose a “reasonable” U this means that we 
cannot have P(~r | s ) = p(~r) if we want to have choice of coordinate utility 
make a difference. 

Note though that the assumption about the subinterval of [0.0, 1.0] will be 
violated if U has isoclines of non-zero probability. This will occur if p has delta 
functions, or if £ is a Euclidean space and U has plateaus extending over the 
support of P(z | s). A particular example of the former is when £ is a countable 
space — the theorem does not apply to categorical spaces. 

B Theory of Generalized Coordinates 

It can be useful to view coordinates as “subscripts” on ‘Vectors” z . Similarly, in 
light of their role as partitions of C, it can be useful to view separate coordinates 
as separate sets, complete with analogues of the conventional operations of set 
theory. As explicated in this appendix, these two perspectives are intimately 
related. 

Now define z p = ~p(:z), so z- p = p(z). Typically we identify the elements 
of z p not by the sets making up "pW, but rather by the labels of those sets. 
This notation is convenient when £ is a multi-dimensional vector space, since it 
makes the natural identification of contexts with vector components consistent 
with the conventional subscripting of vectors. For example, say £ = 9t 3 , with 
elements written (x, y, z). Then a context for an “agent” making “move” x, p x , 
is most naturally taken to be the partition of ffi 3 that is indexed by the moves 
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of the other players, i.e., the values of y and z. La other words, specifying y and 
z gives a line delineating the remaining degrees of freedom of setting z € ![R 3 
that are available to agent x in determining its move, and each such line is an 
element of the partition p x . For this p x , we can take the complement A p x to be 
the partition of 9I 3 whose elements are planes of constant x, i.e., whose elements 
are labeled by the value of x. We can then write ~p x (z) ~ z p* — z x . With this 
choice z x is just z’s x value (recall we identify an element of z x by its label) . 
This is in accord with the usual notation for vector subscripts 

To formulate a set theory over coordinates, first note that coordinates are not 
just sets, but special kinds of sets — a coordinate’s elements are non-intersecting 
subsets of C whose union equals C. So for example to have p\ U p 2 be a coor- 
dinate, it cannot be given by the set of all elements of pi and p 2 , as it would 
under the conventional set theoretic definition of the union operator. (If the 
union operator were defined in that conventional manner, its elements would 
have non-zero intersection with one another.) This means that we cannot sim- 
ply view coordinates as conventional sets and define the set theory operators 
over coordinates accordingly; we need new definitions. 

To flesh out a full “set-theory” of coordinates, first note that the complement 
operation has already been defined. (Note that unlike in conventional set theory, 
here the complement operator is not single- valued.) We can also define the null 
set coordinate 0 as the coordinate each of whose members is a single z 6 C. So 
0 is bijectively related to and "0 can be taken to be the coordinate consisting 
of a single set: all of C . 

To define the analogue of set inclusion, given two coordinates pi and p 2 , we 
take pi C p 2 iff each element of pi is a subset of an element of p 2 . Intuitively, 
Pi is a finer-grained version of p 2 if pi C p 2 , with Pi(z) always providing at 
least as much information about z as does p 2 (z). So pi is a delineation of a 
set of degrees of freedom that includes those delineated by p 2 . Note that V p, 

0 C p C "0, just as in conventional set theory. 

One special case of having pi C p 2 is where every element of pi occurs in p 2 , 
as in the traditional notion of set inclusion. (For our purposes we can broaden 
that special case, which is what we’ve done in our definition.) Note also that 
the C relation is transitive and that both pi C p 2 and p 2 C p ± iff p l = p 2 , 
and that pi C p 2 means there are "pi and "p 2 such that ~p 2 C A p 1) just as in 
conventional set theory. 

The other set-theory-like operations over coordinates can be defined by gen- 
eralizing from the special case of conventional vector subscripts. For exam- 
ple, pi H p 2 is shorthand for a coordinate whose members are given by the 
intersections of the members of pi and p 2 . We make this definition to ac- 
cord with the conventional vector subscript interpretation of z Pl Uf>2 as hav- 
ing its elements be the surfaces in £ of both constant z Pl and constant z P2 . 
(E.g., when ( = EH 3 and has elements written as (x, y,z), “z xU y” means z x>y , 
which is the set of points of constant z x and z y .) Given this interpretation, 
write z PlUp2 = * (p 2 U p 2 ) s ~p 1 fl "p 2 . This then means that the elements of 
Pi H p 2 = z ~piU~p 2 should be surfaces of constant z~ Pl = pi(z) and 'constant 
z - P2 = p 2 (z), exactly as our definition of the intersection operator stipulates. 
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Note that pi Dp 2 C pi , as one would like. Intuitively, the intersection operator 
is just the comma operator given by Cartesian products. (E.g., when ( = *R S 
and has elements written as (. x , y), z x n Zy is indexed by the vector (z x , z y ).) 

Finally, the intersection operator defines the union operator as pi U P2 = 
~(>i n > 2 ) = H z P2 ). To illustrate this, in the example of fft 3 , where the 
elements of p x are lines of constant (y, z), and the elements of p y are lines of 
constant (x, z), the elements of p x U p y are planes of constant z. Similarly, when 
Pi £ P 2 , P 2 \pi is shorthand for a particular coordinate p Q P 2 that is disjoint 
from pi (i.e., such that piC\p = 0) and such that pi U p = P2* Both operations 
are not single- valued, in general. 

Note that in analogy to set theory, any coordinate pi such that there is no 
P2 Q Pi is equal to the null set coordinate. The analogue of a “single-element 
set” is a coordinate p that contains only itself and the null set. This is any 
coordinate all of whose members but one consist of a single z 6 C, where that 
other member consists of two such z. 


C Miscellaneous Proofs 

Proof of Thm. 1 : Choose any z x , z" € p(z). sgn[jV p vAZ) - N Pi u z {z”)] = 
sgn[Z7i (z') — Ui (z 77 )] for all such z 7 and z 77 , by definition of intelligence. Similarly, 
sgn[JVp (z r ) — N P}Ua (z")] — sgn[Z72(^0 — U 2 {z")} for all such points. But by hy- 
pothesis, Np'uAz") = N p>Ul (z ") and N p ,u 2 (z') = N PtUl (z'). So sgn[N PtUl ( z ') - 
N P! u 1 (z")] = sgn [N Pt u 2 (z') — N p ,u 2 ( z ")]- Transitivity then establishes the for- 
ward direction of the theorem. 

To establish the reverse direction, simply note that sgnj77i(z 7 ) ~ Ui(z")] = 
sgn[b r 2 (z 7 ) — U 2 (z")] V z' € p(z), by hypothesis, and therefore by the first- part of 
the definition of intelligence, Ui and U 2 have the same intelligence at z". Since 
this is true for all z” € p(z), D\ and U 2 have the same intelligence throughout 

pW- QED. 

Proof of Thm. 2: Consider any z 7 , z” € p{z). We can always write sgn [U 2 {z ,f )— 
U 2 {z t )\ = sgn[$(i7 2 (z 77 ),p(z)) - 3>{U 2 {z f ),p(z))\, due to the restriction on $. 
Therefore U\ and U 2 have the same intelligence at z\ by the first part of the 
definition of intelligence. Since this is true V z 7 € p(z), Ui and U 2 are factored 
at z. This establishes the backwards direction of the proof. 

For the forward direction, use Thm. 1 and the fact that the system is factored 
to establish that V z in C, V z 77 , z 7 G p(z), Ui{z r ) = Ui(z ,r ) iff U 2 {z!) — Z7 2 (z 77 ). 
Therefore for all points in p(z), the value of U\ can be written as a single-valued 
function of the value of U 2 . Since Thm. 1 also establishes that Ui(z') > U± (z 7/ ) 
iff U 2 (z f ) > U 2 (z' f ), we know that that single- valued function must be strictly 
increasing. Identifying that function with completes the proof. QED. 

Proof of Thm. 3: CDF(V(o>, k) | l a , k) < CDF (Vfa k) | k) means that for 
any fixed z 7 , with y = V(z / ), 

P{w:V{w,k)<y\l a ,k) < P{w:V(w,k)<y\l\k). 
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This is equivalent to 

P(z:V(u(z),K(z))<y\l a ,k) < P{z : V(u(z), k(z)) < y k), 

i.e., 

P(z : V(z) < y | l a ,k) < P{z :V{z) <y \l b ,k). 

Since z € k in both of these probabilities, by the second part of the definition 
of intelligence we get 

P(z : N K y ( . ik) (z) < N K y^ k) (z') | l a , k ) 

< P[z : N n y(.,k){z) < N K y^ >k )(z r ) | L b , k) Vz' € k. 

This in turn is equivalent to CDF(A^ v r (.,i) | P, k ) < CDF(AT K ^ \ l b , k). 

Next write E(N K y^ tk) \ n, k) — dy y P{N^y { _ ik) = y\n, k). Integrate by 
parts to get 


E(N K y itk) I P,k) - E{N K y { . >k] I l b ,k) = 

f dy [CDF(JV K y C ' k) ! l b , k) - CDF (N K y ( ., k) 1 P, k)]. 

Jo 

Since Vy, CDF(7V /c> y(.,fc) \ P,k)(y) < CDF (#«,*>■(.,*) | l b ,k)(y), this last integral 
cannot be negative. The analog for equalities of CDF’s and expectations rather 
than inequalities follows similarly QED. 

Proof of Lemma Is Since both Pi are normalized and they are distinct (if 
they aren’t distinct, we’re done), 3 u* such that P\(u*) > P 2 (u*). By our con- 
dition concerning the P*, Pi(u) > P 2 (u) Vu > u*. Similarly there exists a u 
everywhere below which P 2 exceeds Pi. Accordingly, there is a greatest lower 
bound on the u*% T. V y < P, P ± (u < y) < P 2 (u < y), and therefore by the 
non-negativity of (p f ,^y< <f>(T),Pi(u : </>(u) < y) < P^i^ ’ <P(v>) < y). So the 
CDF of according to P\ is less than that according to P 2 everywhere below 
T. Therefore if there is to be any y value at which the CDF of <j> according to 
Pi is greater than that according to P 2 , there must be a least such y value, and 
therefore a corresponding least such a, u’ . We know that v! > T. However for all 
u > P, Pi(u) > P 2 (vl). Therefore P\(u : <p(u) > <f>(u f )) > P 2 {u : <j>{u) > 0(w')). 
Summing the Pi probabilities of <j>{u) exceeding and being less than <£(u'), arid 
doing the same for P 2 , we see that both P* cannot be normalized, which is 
impossible. QED. 

Proof of Thm. 4: When the t/?’s both equal and A = v, by its definition H 
must be the actual associated n-conditioned distributions over x, P(x | n a ) and 
P(x | n h ). 

To complete the proof we must demonstrate that there is at least one 
parametric form for H that obeys the condition in the theorem when one of 
the ^’s does not equal jp and/or A / 1 /. We do this by construction. First 
take the derivative of each ambiguity (one for each x) to get the convolutions 
/ d y 1 dy 2 P^(y 1 ] l, x 1 )P 7 p(y 2 ; l , x 2 )6(y~(y 1 -y 2 )). Multiply each such convolution 
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by y and integrate the result over all y. This gives us the differences between the 
means of all the distributions P^(y: l,x) (one distribution for each x). Translate 
all those means, M(ip, Z, x), by the same amount so that the lowest one has value 
1. Then take (x 1 | Z) oc 

Use the relation between ordered and unordered ambiguity to rewrite the 
condition in the theorem as tu{x 1 , x 2 )A(^> a ; Z a , x 1 , x 2 ) < tv(x l ,x 2 )A(ip b ; Z b , x 1 ,! 2 ). 

Consider some particular pair x 1 , x 2 , where without loss of generality tu(x x , x 2 ) = 

1. Integrate A(y; ip a ; Z a , x 1 , x 2 )—A(y; ip b ; Z b , x 1 , x 2 ) by parts. So long as y[A(y; ^ a ; Z a , x 1 , x 2 )— 
A(y; Z b , x 1 , x 2 )] goes to 0 as y goes to either positive or negative infinity, the 
result is 


— [(M^V'Sx 1 ) - M(#\Z G ,x 2 )) - (Mtyhjh^x 1 ) -M(^ b ,Z b ,x 2 ))]. 


By hypothesis, ^(x^x 2 ) times this expression must be negative. Therefore 

r>[^ a ;X)/ 1 jja\ lijb\ 

Now a PP } y Lenima 1. QED. 

Proof of Coroll. 2: Expand P(Z7 j r, s) — f dndxP(n j r, s)Z7(x, r)P^(x j n). 

By the second premise we can write this integral as 

J dndxP(n j r, s)U (x, r)pf* /, °l (x | n, s) = J dndxP(n j r, s)Z7(x, r)P^* ;I/ ’^(x | n, s) 

= [ dndxP(n j r, s)£7(x, (x | n, s, W) 


/< 

/< 


= / dnP(n | r, (17 | n, 5, W). 

QED. 

Proof of Coroll. 3: For both ip — g s a and ip = g s b , expand 

E^{N p \n,s b ) = J dr P{n 1 ^ E^(N p | n,r,s b ). 

Rearranging terms gives the hypothesis inequality of our corollary. Now apply 
CorolL 2 to the consequent inequality of the third premise with S2 = 77 — 0. 

QED. 


Proof of Thm. 5: By condition (iv), the quantity y* defined there must equal 
g sa (x*,r). Now fix x 1 and x 2 . By conditions (ii) and (iii), for both of those 
moves x% g sa (x% r f ) has either the value 0 or 1 for all r f arising in the expansion 
of A(g sa ;n, x 1 , x 2 ). Combining this with the value of t/*, we see that for any r 
and any pair (x 1 , x 2 ), one of the following four cases must hold: 

I) p s a(xV) = 0, and 

P(g s a = y ; rc^x 1 ) is a delta function about 0, and 

P(g s a = y: n, x 2 ) is an average of two delta functions, centered about 0 
and about 1; 
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II) £ s a(>V) = 1, and 

P(g s a = y\n,x l ) is a delta function about 1, and 

P(g s * = y;n,x 2 ) is an average of two delta functions, centered about 0 
and about 1. 

(Cases (III) and (IV) are the same as (I) and (II), just with x l and x 2 inter- 
changed.) 

Without loss of generality assume that we’re in case (II). Then expand 
A{y]U,g s a\n,x 1 ,x 2 ) as 

f dy 1 dy 2 P{g s a = y 1 ;n,x 1 )P{g sa = y 2 \ n,x 2 )Q[y-(y l -y 2 ) sgn[U(x 1 ,r)-U(x 2 1 r)]]. 

(C.2) 

This evaluates as 

J d y 2 P(g s a = y 2 -,n,x 2 )0[y - (I - y 2 )&gn[U{x 1 ,r) - U(x 2 ,r)}]. 

Now sgn [p s a (x 1 ,r)—g s a (r 2 , r)] equals 0 or 1 for case (II). So by condition (i), and 
the factoredness of g sa and <7 s t, this must also be true for sgnjT/^x 1 , r)—U(x 2 , r)]. 

Given that y 2 cannot exceed 1, this in turn means that the theta function is 
nonxero only for non-negative y . Accordingly, so is the ambiguity. 

This character of the ambiguity holds for all four cases: for all of them the 
ambiguity A{y\g_ s *, n, x 1 , x 2 ) is 0 up to y = 0 where it may have a jump, and 
then is flat up to 1, where if the first jump did not go up to 1 it now has a second 
jump that gets it up to 1. So its support is assuredly non-negative. QED. 

Proof of Thm. 6: Define m = tu(x l , x 2 ). Our condition means that 

J d y 1 d y 2 0[y - ( y 1 - y 2 )m]P(V a (x 1 , p) = y 1 | l')P(V a (x 2 ,p) = y 2 1 1') = 

J dy l d y 2 0[y - (y 1 - y 2 )m J P(KV b (x l ,p) + h(x x ) = y 1 j l)P(KV b (x 2 , p) + h(x 2 ) = y 2 | l), 

i.e., 

J dy 1 dy 2 Q\y- ( y 1 - y 2 )m}[P{V a (x l , p) -y 1 \ l')P(V a (x 2 , p) = y 2 | Z')] 

= J dr 1 dr 2 @[y — mK{V b {x l ,r 1 ) - V b (x 2 ,r 2 )) - K^x 1 ) - h(x 2 ))]P(r l ,r 2 ; l) 

= J dr 1 dr 2 Q[y/K — m(V b (x 1 ,r 1 ) — V b (x 2 ,r 2 )) — m^x 1 ) — h{x 2 ))/ K)P{r x ,r 2 \ l) 

= J d y 1 d y 2 Q[{y/K - m{h(x l ) - h(x 2 ))/K} - ( y 1 - y 2 )m ] 

P(V b (x\p) = y 1 | l)P(V b (x 2 ,p ) = y 2 1 1). 

QED. 
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Proof of Thm. 7: To prove (i), first marginalize out y 2 from the equality re- 
lating PV a and PKV b +h, and then use the resultant equality between probability 
distributions to form an equality concerning the two associated variances of y 1 . 
The resultant formula for K holds for any x 1 , and therefore it holds under 
arbitrary averaging over the x 1 . 

To prove (ii), use the equality relating Py a and pKV b +h to relate the expected 
values of the difference (y 1 — y 2 ), evaluated according to the two distributions 
Pv a and Py b : 

J dr 1 dr 2 P(r 1 ,r 2 ;l\x 1 ,x 2 )[V a (x 1 ,r 1 ) - V a {x 2 ,r 2 )} 

= h(x x ) — h(x 2 ) + K Jdr^r 2 P { r\r 2 -jW,x 2 )[V b (x\^ — V b (x 2 ,r 2 )]. 

Next collect terms to get an expression for [/i(x 2 )—/i(x 1 )]/iir in terms of expected 
values of V a and V b . Finally plug in the definition of A/ and evaluate K to verify 
our equation for [h(x 2 ) — h(x x )]/i<r. QED. 


Proof of Thm. 8: To prove (i), note that since P(r'; l) — P(r /'), and since 
V a and V b have the same lead utility, E(V a ; Z'^x 1 ) — E(V a ; Z',x 2 ) = E(V b ; Z, x 1 ) — 
E(Vb ; Z, x 2 ). Therefore the drop in leamability means that f dx /(x) Vai(V a ; Z', x) < 
J dx f(x) Var(V&; Z, x). Plugging this into Thm. 7(i) gives the result claimed. 

To prove the second part of (ii), for pedagogical clarity define m = ty A (x 1 , x 2 ) 
and write the derivative as 


f dr 1 dr 2 P(r 1 ,r 2 ;Z , ,x 1 ,a; 2 )J(m[V r 0 (x 1 ,r 1 ) - V a (x 2 ,r 2 )]) 

= J dr 1 dr 2 ^(r 1 , r 2 ; l, x 1 , x 2 )5(m[K{V b (x 1 , r 1 ) - V b (x 2 ,T 2 )} + h^x 1 ) - h(x 2 )}) 

= K -1 J dr 1 dr 2 P(r x ,r 2 ; l,! 1 ,! 2 ) 

r/ nr , ! u 2 2 m m[A f (V b ;l,x 1 ,x 2 )-A f (V a ;l',x 1 ,x 2 )] 

^(TO[Vb(x ,r ) - F t (xV )] ■ = 

y / dxf(x)Var(V b ;Lx 1 ^x 2 ) 


where Thm. 7(ii) was used in the last step. By hypothesis, the difference in 
learnabilities equals zero though. This establishes the result^ claimed. 

To prove the first part of (ii), use similar reasoning to write the value of the 
ambiguity at the origin as 


J dr 1 dr 2 P(r 2 , r 2 ; Z', x 1 ,x 2 )©(m[F a (x 1 , r 1 ) - V a (x 2 , r 2 ))) 

— J dr 1 dr 2 P(r x , r 2 ; Z,x x , x 2 ) 

m[Af(V b;Z,x a ,x 2 ) - A/(F a ; Z'^x 2 )] 


5{m\y h (x 1 , r 1 ) - V b (x 2 , r 2 )] - 


yj f dxf(x)Var(V b ; Z, x 1 , x 2 ) 


)• 
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(iii) is immediate from Thm. 7(i). 

Finally, to prove (iv), without loss of generality take K < 1, and use the trick 
in (ii) with s* = s to increase K to 1. Doing this reduces the maximal slope of 
the associated ambiguity. In addition, it results in a right-shifted version of the 
ambiguity A(V&; Z, x 1 , x 2 ). Therefore this reduced maximal slope is the same as 
the maximal slope of A(V^ Z,# 1 ,# 2 ). QED. 

Proof of Coroll. 5: Due to their all obeying Coroll. 4(ii), all utilities share 
the same m, which equals all of their m n s. Write 

®[({y/Ki t .,v*,i t ,v t } - A t)t . iT 2 jX 2 ) - m(V t (x 1 ,r 1 ) - V t (x 2 , r 2 ))]. 


J dr 1 dr 2 P(r* ,r 2 ; /t*,^ 1 ,^ 2 )©^ — m(V*(a: 1 ,r 1 ) — ^‘(x 2 ,?- 2 ))] 


On the other hand. 




= J dr 1 dr 2 P(r l ,r 2 ; Z t , x 1 , x 2 )0[j/-m(V r t (x 1 , r 1 )-^ (a 


*))]• 


By comparing our formulas for the two ambiguities, we see that as long as 

y 

At,t* j2 ,l x 2 ^ 1/ V 1/ 6 

it follows that A(V*(., r), Vi; Z^z 1 ,# 2 ) > A(V*(.,r), F*; Zt*,z\z 2 ). Furthermore, 
by our formulas for algebraic manipulation of iCs, we know that Ki t . $K t t . ,z f ,v t = 

By Thm. 8(iii), this just equal PKi t «y,i t y t = 

Accordingly, L t}t +y*, x i iX 2 is the set of values /3 by which one could multiply 
and still have the desired inequality hold, given the values of D t , x i jX 2 and 
B t x i x 2 . L t yy* is then defined as the set of such multiples for which we can be 
assured that the inequality holds for every (z 1 ,# 2 ) pair. So for every (5 in that 
set, we know that ■(/3V'* , Z t *) has better ambiguity than does (V*,Z t ), for every 
single (x l ,x 2 ) pair. Accordingly, by Coroll, i, it has better expected intelligence 
as well. That means that so long as 0 £ U t L t yy*, it follows that (fiV* , It * ) has 
better expected intelligence than some (V t ,lt)- QED. 


Proof of Thm. 9: By Thm. 2, a utility Ui is factored with respect to U 2 for 
agent p at z iff we can write it as Ui(z') = $> r (U 2 (z r )) for some r-parameterized 
function $ whose first partial derivative is positive across all z* e p{z). Any 
such function can always be written as F r (U 2 ) — D for some function D only 
dependent on p(z) and some /-parameterized function^ whose derivative is 
positive. This establishes (i). 
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To minimize the learnability of U\ given <£, Z, and U 2 , first note that since 
D is independent of x, the numerator in the definition of A/({7i; Z,# 1 ,# 2 ), 
J5(Z7*i; Z, a; 1 ) — E(Ui;l,x 2 ), is independent of the choice of D. So we need only 
consider the denominator. Rewrite that denominator as 

£f /w [Var(Z7i;Z,e)] 

= (1/2) j dx f(x) J dr' dr" P(r'; Z)P(r"; Z) [&i (x, r') - U x (x, r")] 2 


where we have used the fact that Var^^j — (1/2) / dti d t 2 P{ti)P{t 2 )[A{t^) — 
A^)] 2 for any random variable r with distribution P. 

Bring the integral over x inside the other integrals, expand U lz and introduce 
the shorthand Di(x, r) = F r (U 2 (x , r) to get 

(1/2) J dr'dr"P(r';Z)P(r";Z) J dx f(x) [D, (x, r')-A (x, r") - (P(r') -P(r"))] 2 . 

The innermost integral is minimized for each r', and r" so long as for each r' 
and r", 

D(r')-D(r") = J da f{x)\D 1 [x,r') - Z>i(x,r")]. 

This can be assured by picking D(r) = P/( x ) (Di(£ ? r )) f° r all r. This establishes 
(“)• 

Since E{TJi\r,s,x l ) — P(Z7i;r,5,x 2 ) = P(Z7 2 ; r, s,x x ) — P(Z7 2 ;r, s,x 2 ), the 
ambiguity shift in going from U 2 to Ui equals 


{E{Ui]l,x x ) — E(Ui;l,x 2 )) 



j E f CVai(U 2 ;iF) } 

E f (Vax(U i; l,Z)) j ' 


So what we need to do is minimize * 

Now for our choice of D , by the reasoning above, 

Ef(Vai(U i; l,0) = (1/2) J dr'dr"P(r';Z,)P(r";Z) Vax /(x) (A(^,0-Pi(/,r")). 

Now again use the fact that Var^^} = (1/2) f dti dt 2 P(ti)P(t 2 )[A(ti)—A(t 2 )] 2 
for any random variable r with distribution P and associated function A to ex- 
pand the Var / into a double integral. Next rearrange terms, and again use that 
fact, this time to reduce the integral over r f and r ft into a single variance. QED. 

Proof of Thm. 10: Any change to z that doesn’t move it out of the set B n 
7r'(z) doesn’t move it out of B Pi 7r(z), since all z in any element of P he in the 
same element of 7 r. Therefore that change to z doesn’t change 7r(z). That means 
in turn that it does not change Di(CL~ n (z).) So Di(CL- lz {z).) can be written 
as a function that depends only on B fl7r / (z). Therefore it is of the form for the 
secondary utility required for the difference utility to be factored with respect 
to agent B Pi ^(z). QED. 
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Proof of Thm. 11: Note that H(CL~ K (z)) can be written as a function of 
k(z ), and therefore of p{z). Accordingly, expand the numerator term in the 
definition of learnability in terms of r to see that that it has the same value for 
H and WLU 

Write out WLUh>(jz) = Hi (z) - iJi(CL- K (z)) to see that the denominator 
term for A/(WLU jH ' ) «; Z, x 1 , x 2 ) is bounded above by 

J d xf(x) J dr' P(r';l){Hi(x,r') — Hi(CL- K (x,r'))] 2 . 

In turn, the greatest possible value of the term in square brackets is M. So that 
denominator term is bounded above by M. 

Write the denominator term for A/(U; l,x l ,x 2 ) as 

(1/2) J dxf(x) J dr' dr" P(r';l)P(r"-,l) x 

[{H 2 (K(r')) - H 2 (K(r"))} + {i?i(x,r') - H,{x,r")}] 2 

= (1/2) J dxf(x) J dr' dr” P(r’;l)P(r";l) 

{[H 2 (K(r')) - H 2 (K(r"))} 2 + [H,(x,r') - tfi(x,r")] 2 + 

2 [ffeMr')) - ff 2 (K(r"))][Hi(x,r') - H x (x,r")]}- 

The third of the integrals summed in this last expression is bounded below 
by 

-VM j dxf(x) J dr' dr" P(r l a)P(r"-J)\H 2 (K(r')) - H 2 (K(r"))\, 

which in turn is bounded below by —VML, due to concavity of the squaring 
operator. The second of our integrals is bounded below by 0. Finally, the first 
of these integrals equals L/2 exactly. Comb ining , the denominator term for 
A ,x 2 ) is bounded below by L/2 — \[ML. QED. 


D Repeating Coordinates, Multi-Step Games, 
and Constrained Optimization 

Say we have a set of coordinates of </, indicated by {C^Ct • • • , ( T }t with as- 
sociated images of C written as {C 1 , C 2 , , C T }. Conventionally the index t 
is called “time” or the “timestep”. An associated repeating coordinate is a 
set {A 1 , A 2 , . . . , A r } such that V i, A t (z) = A(C e (z)) for some function A whose 
domain is given by the union of the ranges of the coordinates {£*}, Z. For a de- 
terministic set there is a set of single- valued functions {E 1 }, mapping Z 

to Z, such that £ i+1 = E X {C) V i e {1, . . . , T-l}. The set is time-translation- 
invariant if E l is the same for all i , and (temporally) invertible if the E x 
are all invertible. 
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In close analogy to conventional game theory nomenclature, we say that 
we have a set of players {z}, each consisting of a separate triple of repeating 
coordinates {p*}, {£*}, and {i/*}, if for each t and i the triple vf) act as 

the context, move, and worldview coordinates, respectively, of an agent. If in 
addition T > 1, we sometimes say we have a multi-step game, and identify 
each “step” with a different time. 

Often we want to consider the intelligences of the players’ agents with respect 
to some associated sequences of private utilities. We can do this if in addition to 
the players we have a repeating coordinate {a*}, s 1 being the design coordinate 
value set by the designer of the collective, and g#{z) = £i ?<r «(z)( z ) being the 
private utility of player i at time f. 55 In this way each player is identified with 
a sequence of agents. 

A multi-stage game is one in which for every z, g it is the same function 
of z T € Z. A normal-form (version of a multi-stage) game is the system C 1 
with associated coordinates and set of allowed points C 1 , where P(zi) is set 
by marginalizing P{z). So in particular, P(g i i(zi) — v) — / dz P(z T j zi)S(v — 
9 iT(z T )). Intuitively, a normal form game is the underlying multi-stage game 
“rolled up” into a single stage, that stage being set the initial joint state of 
the players. 

If for every i, g it is the same function from z* € Z to the reals, then we say 
we have an iterated game. More generally, if for each player i all of the {<7^} 
are the same discounted sum over if e {1, - - - , T} of Ri(z f ') for some real- valued 
reward function Ri that has domain Z, then each player’s agents must try to 
predict the future, and we have a repeated game. 

Note that conventional full rationality noncooperative game theory of nor- 
mal form games, involving Nash equilibria of the private utilities, is simply the 
analysis of scenarios in which the intelligence of z with respect to each player’s 
private utility, given the context set by the other players’ moves, equals 1. This 
fact suggests many extensions of conventional noncooperative game theory based 
on the formalism of this paper. For example, we can consider games in which 
C 7^ i.e., not all joint-moves are possible. Another modification, applicable if 

we use the percentile-type of intelligence, is to restrict dfj, p to some limited “set 
of moves that player p actively considers”. This provides us with the concept 
of an “effective Nash equilibrium” at the point z, in the sense that over the set 
of moves it has considered , each player has played a best possible move at such 
a point. In particular, for moves in a metric space, we could restrict each dp p 

5 5 An interesting topic is whether for a particular player there is a set of functions {27* (z*)} 
such that the values {x*} induce large N p t jjt (z*), V t € {X, . . . , T}. When there is such a set, it 
would seem natural to interpret the player as a set of “agents” with associated private utilities 
{27*}. However unless we can vary the private utility that the time t “agent” is supposedly 
trying to maximize, we have no reason to believe that the value x* really is set by a learning 
algorithm trying to maximize that private utility. (We might have a coordinate akin to the 
explicitly non-learning spins in Ex. 1 of [22].) This means that for such an interpretation be 
tested, the-private utility must be part of some {cr*}, so we can set it. Our modifying it must 
then induce associated changes in the moves consistent with the supposition that a learning 
algorithm is controlling those moves to try to maximize those values of the private utilities, 
as discussed in the subsection on the first premise. 
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to some infinitesimal neighborhood about z , and thereby define a “local Nash 
equilibrium” by having p’s intelligence with respect to utility equal 1 for each 
player p . 

More generally, as an alternative to fully rational games, one can define a 
bounded rational game as one in which the intelligences equal some vector e 
whose components need not all equal 1. Many of the theorems of conventional 
game theory can be directly carried over to such bounded-rational games [19] by 
redefining the utility functions of the players. In other words, much of conven- 
tional full rationality game theory applies even to games with bounded rational- 
ity, under the appropriate transformation. This result has strong implications 
for the legitimacy of the common criticism of modern economic theory that its 
assumption of full rationality does not hold in the real world, implications that 
extend significantly beyond the Sonnenschein-Mantel-Debreu Theorem equilib- 
rium aggregate demand theorem [11]. 

Note also that at any point z that is a Nash equilibrium in the set of the 
player’s utilities, every player’s intelligence with respect to its utility must equal 
1. Since that is the maximal value any intelligence can take on, a Nash equilib- 
rium in those utilities is a Pareto optimal point in the values of the associated 
intelligences (for the simple reason that no deviation from such a z can raise any 
of the intelligences). Conversely, if there exists at least one Nash equilibrium in 
the player utilities, then there is not a Pareto optimal point in the values of the 
associated intelligences that is not a Nash equilibrium. 

Note that the moves of some player i may directly set the private utility 
functions of the agent (s) of some other player V in a multi-step game. In partic- 
ular, the private utilities of Vs agents might explicitly involve inferences about 
the effect on P(G | s t ) of various possible choices of g^y. Loosely speaking, 
when an agent of player i changes the learning algorithm, move variable, world- 
view variable, and/or private utilities of (the agents of) other players, and does 
so gradually, based on considerations of how to improve P{G | s t ), we refer 
to its learning algorithm as engaging in macrolearning; that agent’s moves 
constitute on-line modification of s to try to improve G. We contrast this with 
micro learning, in which one agent’s moves are not viewed as directly setting 
other agents’ private utility functions, in loose analogy with the distinction be- 
tween macroeconomics and microeconomics. 56 

In any kind of game, each agent only works to (try to) maximize its current 
private utility, 57 However g it will not be mutually factored (with respect to 
moves x f ) with either the utilities or with G, in general. Intuitively, moves 
that improve the current private utility may hurt the future one, and may even 

56 In general, we wish to optimize G subject to the communication restrictions at hand. 
When the nodes are agents, such restrictions apply to the argument lists of their private 
utilities. More generally though, the nodes can communicate with each other in ways other 
than via their private utilities. Indeed, part of macrolearning in the broadest sense of the 
term is modifying such extra- utility “signaling” and “bargaining” among the nodes, to try to 
improve performance of the overall system. None of these “low level” issues are addressed in 
this paper. 

5 7 Formally, the first premise applies to moves and private utilities that share the same time, 
since here the full agent is defined for a single time. 
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(due to those future effects) hurt G. (See [1] for an example of this) . In repeated 
games where G is itself a discounted sum, appropriate coupling of the reward 
function of the player with that of G can ensure factoredness of those two 
reward functions. However in iterated games — which for example are those that 
arise with the Boltzmann learning algorithms considered in [17] — there is no 
such assurance. And even for repeated games with discounted sum G’s, simply 
having each of the player’s rewards be factored with respect to the associated 
reward of G does not ensure that the player’s full private utility is factored with 
respect to G. 58 

Another subtlety arises if there is randomness in the dyn ami cs of the system 
at times t' > £, and we are considering a utility function at time t that depends 
on components of z other than z l (e.g., we have a multi-stage game). The 
problem is that in general we require utility functions to be expressible as a 
single-valued function of the move and context of any agent. So in particular 
our utility must be such a function of (x*, r*), despite the stochasticity at times 
t' > t. 

One way around this problem is not to cast the problem as a multi-step game, 
and instead have contexts explicitly includes future states of the system. We can 
keep the game-theoretic structure though if we have z specify the state of the 
pseudo-random number generator underlying the stochasticity, and then have 
that state be included in r\. This encapsulates the stochastic dynamics within 
a deterministic system. Another approach is to recast utilities and associated 
intelligences in terms of partial worldpoints z 1 '— t rather than full worldpoints 
that include time to the future of £. As an example, starting with a conventional 
utility U, we could define a new utility U ( z ) = E(U | z 1 ’- 1 ). Since U{z r ) = U{z) 
if {z'Y = z l N jj(z) only judges z by the quality of its components for 
times previous to the future. 

There is another subtlety that can arise even in deterministic games, from the 
general requirement that any move can accompany any context. The problem 
is that this requirement is, on the face of it, incompatible with constrained 
optimization problems, in which typically for any moment t C forbids some of 
the potential joint-states of the agents at that time. The simplest way around 
this difficulty, when it is feasible, is simply to choose a different set of move 
coordinates for the agents, one in which the constraints do not restrict the 
agent’s moves. Another way around this difficulty is to transform the problem 
by means of a function that maps any (unconstrained) pair (x, r) to an allowed 
(constrained) joint-state of all agents, which in turn is what is used to determine 

58 In practice factoredness of reward functions often results in approximate factoredness 
of associated utilities if t is large enough so that the system has started to settle toward a 
Nash equilibrium among the players’ reward functions. In turn, such settling toward a Nash 
equilibrium is expedited if we set s to give a good term 3 in the “reward utility version” of 
the central equation, in which all utilities are replaced by the associated reward functions. 

For the more general scenario where factoredness of reward functions does not suffice, one 
can guarantee factoredness of the utilities by using reward functions set via “effect sets” . As 
discussed in the discussion of the WLU, such reward functions can ensure factoredness by (in 
essence) overcompensating for all possible future effects on G of a player’s current action. A 
more nuanced approach is investigated in [20]. 
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utility values. 

No such function is needed however if the constrained optimization problem 
can be cast as traversing the nodes in a graph with fixed fan-outy so that the 
constraints don’t apply to the moves directly. To see this, first consider an 
iterated game with an “environment” repeating coordinate {#*}. Say that the 
game is a Markovian control problem with N players, i.e., a multi-stage game 
where G(z ) only depends on the value q T and 


P(q t \q t 


v(q t ,x\~ l ,x t <T l , . . . ,x‘y 1 ) 


where v is independent of t 6 {1, . . . , T — l}. 59 

For a graph-traversal version of this problem the dynamics is single- valued, 
so we can write v(q\ . . . , x N ) = 5{q f — x 1 x 2 . . . x N (q)) for some function 
of q and (x 1 , . . . , x N ) that is written as x 1 x 2 . . . x N (q). (For uncountable q , this 
is a continuum-limit graph.) So any constraints o on optimizing G — on finding 
the optimal node q in the graph — are reflected in the graph’s topology. 

This kind of problem is a (fixed fan-out) undirected-graph-traversal problem 
if in addition the values of each & form a group, in the following sense: 

i) V q e 6 y 3!(2i, I 2 , -- -,In) € {(x\x 2 , . . .x^)} such that hh . .-Jjv(g) — 

ii) Vg € 6, V (x[,x' 2 ,.. .x' N ) € {(x 1 ,x 2 ,...x N )},3l((x')^ 1 ,(x , ) 2 1 > ...,(x , )J r 1 ) € 
{(x 1 ,! 2 , . . . ar^)} such that (x')y (x')J 1 • • - (x , ) N 1 x' 1 x 2 . ■ ■ x' N (q) = q. 

In practice, search across such a graph is- easiest when the identity and inverse 
elements of each group of moves are independent of q , and G does not vary too 
quickly as one traverses the graph. 

Finally, as an illustration of off-equilibrium benefits of factoredness, consider 
the case where £ is a Euclidean space with an iterated game structure where 
every p i (z) is a manifold and all of those manifolds are mutually orthogonal 
everywhere on C . Presume that all utilities are analytic. Then for small enough 
step sizes, having each player run a gradient ascent on its reward function must 
result in an increase in G, for a factored system. (However such a gradient ascent 
may progressively decrease the values of some players’ utilities.) 

To see why G must increase under gradient ascent, first, as a notational 
matter, when M is a manifold embedded in £ define VmE(z) to be the gradient 
of F in some coordinate system for M, expressed as a vector in £. Let J p t be the 
tangent plane to p t (z) at z. Then if G is factored with respect to g^t , Vj p (g^t (z)) 
must be parallel to Vj t (G(z)). (If there were any discrepancy between the 
directions of those two gradients, there would be a direction within p t (z) in which 
one could move z and in so doing end up increasing g^t but decreasing G.) So 
the dot product between those gradients is non-negative, and therefore changing 

59 Note that in this problem, G is not a direct function of the players’ joint-move at 
any time. Rather the joint-move specifies the incremental change to another variable — the 
environment — which is what directly sets the value of G. See App. E on gradient ascent over 
categorical variables. 
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z — ► z-\~ \a\ Vj t (gpt (z)) for infinitesimal a cannot decrease G(z). Generalizing, 
note that for any utility U the gradients V? t (U) (one for each p l ) are mutually 
orthogonal, since the underlying manifolds axe. Therefore having all those dot 
products be non-negative means that moving z an infinitesimal amount in £ in 
the direction with components in each plane 3 p t given by V 3 t (^(z)), cannot 
decrease G(z). So gradient ascent works for factored systems. 

Similarly, fix £, and consider two worldpoints z f and z" that are infinitesi- 
mally close, but potentially differ for every player. Then it may be that for no 
player p does p t (z f ) = p t (z /f ): every player sees a different set of the moves of 
its opponents at z f and z ,f . Nonetheless, again using non-negativity of the dot 
products, the system’s being factored means that there must be at least one 
player p for which sgn [G t (z / ) — G t (z ,f )] = sgn[<^t(z') — g p t(z ,f )]. (Compare to 
Thm. 1.) 

E Example — gradient ascent for categorical 
variables 

This example illustrates the many connections between traditional search tech- 
niques like gradient ascent and simulated annealing on the one hand, and the 
use of a collective of agents to maximize a world utility on the other. 

Say we have a Cartesian product space M = M l x M 2 x • • * M L , where each 
M l is a space of \M X \ categorical (i.e., symbolic, non-numeric) variables. Write 
a generic element of M as m, having components m x , i G {1, . . . , L}. Consider a 
function h(m) — ► 91 that we want to maximize. Because M is not a Euclidean 
space, we cannot use conventional gradient ascent to do this. However we can 
still use gradient ascent if we transform to a probability space. 

To see how, take ( to be the space of Euclidean vectors comprising the Carte- 
sian product x S^ m2 \ x * • * where each is the M l -dimensiona! 

unit simplex. Define the function R(z) = (IliLi ^m, ) x h(m)- The prod- 
uct zp = (nf =1 gives a (product) probability distribution over the space 

of possible m G M. (Intuitively, z±j = P(m l = j).) Accordingly, R(z) is the 
expected value of h , evaluated according to the distribution z P . 

Define m* = argmax Tn h(m). Then 

. axgmax z R(z) = { S(z 1A - 0), 6(z h2 - 0), . . . S( Zi itn + - 1),.. . . , 5(*l,|mi| - 0); 

. . . , <5(z2 ?m * — 1 ), ... ; 

. . . , <S(zn,m 2 — l)i • • • J j 

i.e., the z that maximizes R, z*, is a Kronecker delta function about the m 
that maximizes h. However unlike m, z lives in (a subset of) a Euclidean space. 
So if we make sure to always project VJR(z) onto x 5I m2 1 x ♦ *• S^ M ^ 

the space of allowed z, we can use gradient ascent over z values to climb G — 
and thereby maximize h. Intuitively, as opposed to conventional gradient ascent 
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over the variable of direct interest (something that is meaningless for categorical 
variables), here we are performing gradient ascent over an auxiliary variable, and 
in that way maximizing the function of the variable of direct interest . 60 

Note that R is a multilinear function over the (sub) vector spaces {£ r l Ml l } 3 
and its ma ximum must he at a vertex of that space. There are |M l | components 
of the gradient of R for each variable i, giving i \M V \ components altogether. 
The value of the component corresponding to the f th possible value of M % is 
given by the expected value of h conditioned on rrii = j. So calculating VR(z) 
means calculating Y^f =1 \M l \ separate expectation values. Furthermore, at 2*, 
every component of the gradient has the same value, namely h(m), and at all 
other z the value of every component of the gradient is bounded above by 
h(m ). 61 

Unfortunately, calculating VR(z) exactly is prohibitively difficult for large 
spaces. However we can readily estimate the components of the gradient instead 
by recasting it as a technique for improving world utility in a collective. Define 
G(z G C) = R(z T G C T ), where z is the history of joint states of a set of agents 
over a sequence of T steps in an iterated game, z l being the state at step t 
of the game (see App. D). Define z\ as the vector given by projecting z l onto 
the f’th simplex 5 ^ 1 , i.e., the time t- value of the vector (z^i, 2^2 > • • • > Zi,\M l \)' 
Have all LT of the Cartesian product variables z\ x z\ x • • • z[_ L x z * +1 x • * ■ z\ 
be (the value of) a generalized agent coordinate p\, x\ = z\ being the value of 
the associated move. So for every agent, G is a single- valued function of that 
agent ’s move and its context, as required . 62 

The dynamical restrictions coupling all these distributions gives us C . To 
design that dynamics, note that even though R{z t ) is in no sense a stochastic 
function of z t , because of functional form of its dependence on the agents’ moves 
we can use Monte Carlo-like techniques to estimate various aspects of R{z*) . In 
particular, we can estimate its gradient this way, and then have the dynamics use 
that information to increase R y s value from one timestep to the next, hopefully 

60 By our choice of R , here we are only considering distributions over M that have all L of 
the variables statistically independent. Doing so exponentially reduces the dimension of the 
space over which we perform the gradient ascent, compared to allowing arbitrary distributions 
over M. However there may be other restrictions on the allowed distribution that results in 
even better performance. In the translation of the gradient ascent of R(z) into a collective 
discussed below, such alternative stochastic forms of the distribution over m would correspond 
to having agents each of whose moves concerns more than one of the ra* at once. 

61 To establish the first claim, simply note that z* is a delta function, To establish the 
second, note that the gradient . component E(R \ m % = j) is just the expected value of R 
under a different distribution, z y , where z’ and z are equal for all components not involving 
M x , but z‘ has a delta function for those components. Since expected R under any distribution 
is bounded above by R(z *), it must be for z f . Accordingly, each of the components of the 
gradient is bounded above by h(m), which establishes the claim. 

62 Strictly speaking, we need to encode in either r\ or xj the other information specifying 
the full history, e.g., the values of z t for t l < t. Otherwise that pair of coordinates do not 
form a complement pair. For completeness, we can choose to encapsulate all such information 
in r-, as the current value of the seed of an invertible random- number generator used for the 
stochastic sampling that drives the dynamics (see below). None of the analysis presented here 
depends on this choice though. 
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reaching the ma ximum by time T (in which case we have ensured that G is 
maximized). 

More precisely, at the end of each step each agent (z, t) independently 
samples its distribution z\ to choose one of its actions m x € M 1 . That set of L 
samples gives us a full vector m *. Next, we evaluate a function of m t , indexed by 
(z,t), whose expectation (according to z f ) is the private utility for that agent. 
(Note that the joint-action m t is not the joint-move of the agents at time t. 
That is z t .) 

Combining that function’s value with other information (e.g., the similar 
values for i for some times t f < t) provides us a training set for that agent 
controlling variable z. This training set constitutes the worldview for agent (z, £T 
1), n* +1 € and is used by the learning algorithm of agent (i,£- b 1) to form 
a new z\ +1 . This is done by all L agents, giving us a z t+1 , and the process 
repeats. 63 

This dynamics produces a sequence of points {m f } in concert with a se- 
quence of distributions {z*}, which (if we properly choose the private utilities, 
learning algorithms used to update the Z{, etc.) will settle to m* and 6(m — m*), 
respectively. As an example, for all i have the function evaluated at m* be 
h(m t ), so that the private utility of each agent (z,f) is R(z f ). Have the asso- 
ciated training set for (z,t) be a set of averages of h(m). one average for each 
of the possible m*. Have the average for choice j € M x be formed by summing 
the previously recorded h(m) values that accompanied each instance where 
equalled y, where the sum is typically weighted to reflect how long ago each 
of those values was recorded. So each of the \M l \ components of n\ is nothing 
other than a (pseudo) Monte Carlo estimate of the components for variable M l 
of the gradient of R(z) at the beg innin g of timestep t 64 In other words, they are 
estimates of the components of the gradient of the private utility at the current 
joint-move. 

Accordingly, let the learning algorithm for each agent (z, t + 1) be the fol- 
lowing update rule: 


i 


z\ + a. 




t+i 


y . n t+1 

^ - V ? (I 1 ) 

\M*\ ' m " } 


where the term in square brackets is the projection of v^t onto its unit simplex 
S\ M '\ the vector (1,1, . . .) being normal to that simplex. To keep z in its unit 
simplex, have a shrink the shorter the distance along v^ t from z\ to the edge of 
that associated simplex, The result is that each variable in the collective 

performs a Monte Carlo version of gradient ascent on G and therefore on h. 
Moreover, the learning algorithm is a reasonable choice for an agent i trying to 

63 A faster version of this process has ail of the agents at a given time share the same m 
rather than each use a new sample of z L . This can introduce extra correlations between the 
moves of the agents though, which may violate our assumption of statistical independence 
among the {Af 1 }. 

64 It would be exactly Monte Carlo if not for the steps updating the {z £ <t }. It is to account 
for that updating that the data going into the training set is aged. 
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modify its move z\ to increase its private utility. Accordingly, we would expect 
it to obey the first premise . 65 

Note that maximizing G is just a problem in design of collectives. This 
suggests many modifications of the scheme outlined above. In particular, one 
might try many other learning algorithms besides Monte Carlo gradient ascent 
to try to find the z that maximizes G. For example, in a Boltzmann learning 
algorithm, each z\ is given by a Gibbs distribution over the \M % \ possible values 
of its variable, with the \M % \ “energies” going into that distribution given by 
the components of v\. Using the sampling scheme with this distribution may be 
better than gradient ascent if the tendency of the latter to get trapped circling 
local maxima is a concern (say due to the inaccuracy inherent in the Monte 
Carlo estimating of that gradient). Similarly, one can use many private utilities 
besides R , in particular ones that try to exploit the first premise. Moreover, 
all such approaches can be used even if the G and the z’s are not an expected 
utility and associated probabilities over categorical spaces, respectively. The 
idea of inserting learning agents into a search problem to recast it as a problem 
in the design of collectives is much more general. 

As an example, return to the gradient ascent learning algorithm, and con- 
sider replacing h(m) with some h*(m) that is factored with respect to h for 
variable i . This will result in a new R,R*. The partial derivatives of R* with 
respect to the \M % \ components associated with the value of variable i equal 
the corresponding derivatives of I?, up to an overall additive term that is inde- 
pendent of rrii. Accordingly, if we set z* to maximize R* rather than R , while 
having all other coordinates still maximize R y we will arrive at the exact same 
optimizing distribution over m. 

Extending this, we can have each coordinate use an associated R* based on 
an h* that is factored for that coordinate, and it will still be the case that if 
each Zi is set to maximize the associated R* we end up with the same delta 
function over m as if all coordinates were set to maximize R. However there is 
one crucial way that use of R* ’s differs uniform use of R . This arises from the fact 
that rather than ascending the exact gradient, we are ascending a Monte Carlo 
estimate of it. That estimation necessarily introduces noise into the ascent. If 
we can minimize that noise, the ascent should be much quicker. This in fact is 
exactly what is done when we chose the h* 5 s to each have as small ambiguity as 
possible . 66 

65 Note that the updates are invariant with respect to translations upward or downward of 
the function h, since such a translation of h induces an identical translation in R and therefore 
in nj. Similarly, so long as there axe at least two j for which the associated 1 have different 
values, z* +l ^ zf; the updating never halts. This reflects the fact that there are no local 
maxima. 

66 There are other ways of affecting ambiguity besides the choice of private utility of course, 
and they have to be traded off other factors in general. As an example, optimizing the step 
sizes of the agents depends on associated ambiguities. If the stepsizes used by agents other 
than i are too big, then the gradient estimate for coordinate i will be a poor approximation 
to the true direction of maximal ascent. To see this, note that if the stepsizes used by agents 
other than i are too big, then the actual context r for agent i at timestep t + 1 will differ 
significantly from the r at the timestep t. However it is that latter r that determines the value 
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From this perspective, the idea of casting a search problem as a problem in 
design of collectives can be motivated as a way to extend gradient ascent so it 
can be used with categorical variables, by transforming the search to be over 
a numeric space. Furthermore, even if the underlying space is numeric, casting 
the search problem as a problem in design of collectives has the advantage over 
gradient ascent that it naturally allows for large jumps in that underlying space, 
whether the original space is categorical or numeric, the recasting has the ad- 
vantage that it allows the search to be decomposed, into a set of parallel searches 
(one for each agent). If desired, those parallel search can then be implemented 
on a parallel computer. ? 

More generally, there is nothing about this decomposition that restricts 
its use to cases where the original global search algorithm is gradient ascent. 
So in particular, the decomposition can be used directly over a categorical 
space, without first transforming the search to a numeric space. Moreover, the 
search/learning algorithms of the individual agents in the decomposition need 
not be direct analogues of the original global search procedure. So in particular, 
those individual algorithms need not restrict their agents to only change their 
states by an infinitesimally small amount, as in gradient ascent. All of these ex- 
tra capabilities flow from recasting the search problem as a design of collectives 
problem. 

Another modification of vanilla gradient ascent dynamics follows from notic- 
ing we are only estimating the gradient of R, rather than evaluating it exactly, 
and that the estimation is a variant of Monte Carlo. These observations make it 
natural to modify gradient ascent dynamics by inserting a simulated-annealing- 
style keep/reject procedure at the end of every timestep. However we cannot do 
the naive thing, and run that keep/reject procedure on the pair of (the value 
of R{z t ) before timestep fs modification to z t ), and (that value of R after the 
modification). This is because we can no more evaluate R exactly than we can 
its gradient. However we do know what the value of h is for the starting m of 
timestep t and for the new m generated in that timestep. So we can run the 
keep/reject procedure based on those two values of h. 

In fact, we can we can always insert such a simulated-annealing-style keep/reject 
procedure at the end of each timestep, regardless of the private utility function 
and/or learning algorithm. This is exactly what is done in the technique of 
Intelligent Coordinates (IC), sometimes called “Computational Corporations” 
[26]. From the perspective of design of collectives, IC was motivated as a way to 
improve techniques that focus exclusively on terms 2 and 3 in the central equa- 
tion (e.g., by the setting of the private utility). By its insertion of a keep/reject 
procedure, IC boosts the performance of such techniques by leveraging term 1 in 

n at timestep t - f- 1. So having those stepsizes too large means that P(r | n) will be broad. 
This in turn usually induces broad distributions over agent i's private utility values for each 
of its candidate moves. Usually this means that the ambiguity is quite large. 

Conversely, if the stepsize of agent i is too small, then it will be slow in increasing the value 
of its private utility. So while agent i benefits from having the stepsizes of other agents be as 
small as possible,' its stepsize cannot be too small. Since this holds for all agents, we have to 
trade off the two effects when determining the optimal stepsize. 
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the central equation while not degrading terms 2 or 3. Another way of viewing 
IC is as a variant of a conventional simulated- annealing-style keep/reject search 
algorithm. In this variation each searched variable is made “smart”, its explo- 
ration values being the moves of game-playing computer algorithms (agents), 
rather than as in conventional algorithms, to random samples of a probability 
distribution. 67 

As a final example of an approach to optimization suggested by extending 
this gradient ascent example, consider replacing the gradient term with the 
move of a learning agent in the gradient update rule, rather than replacing 
the z\ term. There are several subtleties with implementing such an idea in 
practice [9]. One is that typically the value of a utility will change with t even 
if all the agents freeze their moves with this new approach, since such freezing 
means that the agents are traversing the surface, only in a constant direction. 
This contrasts with the typical case where the learning agents set the {z\} 
directly, and can often result in large ambiguities. Nonetheless, especially when 
in constrained optimization problems like graph traversal, this alternative might 
be the approach of choice. (See App. D.) 

F General situation where the second premise 
holds 

We will illustrate a case where f dnP(n | r. | n) = / dnP(n j r, s)P^’^(x j 

n,s), and therefore the second premise holds. 

Consider the integral f d nP(n | r, s)?M ( x | n) arising in the second premise. 
Expand the distribution in terms of H , and for simplicity say that H does not 
depend on n directly. Next suppose that P(n | r, s) is relatively peaked for fixed 
r and s. This provides a scale length of the ambiguity arguments of id, given by 
how much they vary as n moves across that peak. Say that H is a slowly varying 
function of its arguments on that scale length. (This is particularly reasonable 
if ambiguities vary little as one traverses the peak in P(n | r, s).) Under these 
circumstances we can pull the integral over n inside the H to operate directly 
on the vector of id’s n-dependent arguments, i.e., replace 

J &TiP(tI j r, s) | n)id{y^^. n?x t > x-7)}(^') * -^{J dnP(n|r,s)A(^;n,x%xJ)}0*')'. . 

Next, consider each term f dnP(n [ r, s)A(2 J> \ n, x\ x^) appearing inside 
the id. If we expand that ambiguity and pull in the integral over n, we get 

67 An analogue of IC is a well-run human corporation, with G the corporation’s profit, the 
players i identified with the employees, and the associated g^t given by the employees’ compen- 
sation at time t. The corporation is factored if each employee’s compensation directly reflects 
its effect on G. If each compensation package also has good ambiguity, the employees can read- 
ily discern how their behavior affects their compensation. Finally, the exploration/exploitation 
process is analogous to management’s deciding whether to maintain or abandon a particular 
set of decisions by the employees. These similarities are the basis of the name “computational 
corporation” . 
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expressions of the form j d nP(n \ r,s)P(g J}a (x 1 ,p))P(g PjCr {x 2 ,p)). Now again 
assume P(n | r, s) is relatively peaked, this time on the scale of variations in 
P(g p a (x,p)). This allows us to replace 

J &nP{n\r,s)P{g p ^{x 1 ,p))P{g p ^{x 2 ,p)) -> 

J d nP(n | r,s)P(g fit<T (x 1 , p) | n) J d n'P(n' | r,s)P{g J> ^{x 2 ,p) | ri). 
Expand the first integral in this product as 

J d r'[J dnd.s' p 6(g s ,Jx,r') -y)P(r',s' p | n)P(n | r,s)] 

(and similarly for the second) . 

Say that the first distribution in the integrand is peaked, in s f p , about some 
h(n), and that the second one is peaked about the n lying in the preimage 
ft” 1 (5). (This is exactly true if n specifies s precisely.) Then we can replace 

J dnds p 6(g s , fi (x, r') - y)P(r', s' p j n)P(n | r,s) — 

J d nS(g s (x,r') - y)P(r\ s' | n)P(n | r,s) 

We would have arrived at the exact same expression if we had made the 
analogous approximations in expanding J dnP(n I r, s)P^ v ' a ^ {x J tz, s) instead. 
Hence these approximations justify the second premise. However the second 
premise can hold even if not all of those approximations of peaked distributions 
are valid, so long as there is sufficient cancellation among the contributions 
from the wings of the distributions (e.g., it will hold if v C a regardless of 
such peakedness). So the second premise is weaker than these approximations. 
In fact, under those approximations, we could always replace the ambiguities 
arising in H with their averages according to P(n | r, s), something which we 
do not do in the current analysis. 

G An alternative definition of ambiguity 

Note that rather than P{y^ ,x 2 ), the difference of the distributions 

of utility values at x 1 and r 2 , one could consider the distribution of differences, 

P*{y l ,y 2 \& l , xl , x2 ) = J drdsP{r,s | fySty 1 - ip s (x l ,r))% 2 - ip s (x 2 ,r)), 

and the associated ambiguity A*. Now almost all of the theorems and corollaries 
presented below hold for ambiguities based on A* as well as A, so we could use 
A* rather than A if we wanted to. Moreover, P is P* modified to preserve the 
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marginals of the random variables and t£ 2 while making those variables be 
independent: 

P{y l ,y 2 ; ^ ; l, x 1 , x 2 ) = P*(y 1 ;ip ; l, x 1 ,x 2 )P*{y 2 \ Ip] l, X 1 , X 2 ). 

So A* fixes (P* which fixes P which fixes) A , but not vice-versa, he., A contains 
less information than A *. Furthermore, of all ambiguities based on a distribution 
with the same marginals as P* , A is the “widest” , having the largest region in 
which it is neither 0 nor 1. 

However ail of this does not mean that we are just being more conservative 
by using A rather than A* y i.e., that we are discarding certain predictions con- 
cerning orderings of CDF’s that we would make if we used A *, while keeping 
other such predictions. That’s because in general A can shrink in going from 
one l to another (i.e., its value can decrease for at least one y and not increase 
for any y) while A* does not, and vice-versa. 68 So either choice of ambiguity 
may result in predictions that would not have been made with the other choice. 

In this paper we restrict attention to learning algorithms whose behavior 
depends on increasing/decreasing ambiguities based on A rather than on A*. 
This seems to be the case for most real-world learning algorithms, and therefore 
A rather than A * seems to be the appropriate quantity to plug into our results. 
Only if the learning algorithm exploits information in n about the relation of 
utility values at the same r would changes in A* be a better predictor of asso- 
ciated changes in what move the algorithm is likely to make. This is rarely the 
case though. For example, training sets formed in the course of multi-step games 
(see App. D) contain information about utility values for move/context pairs 
(one such pair for each preceding timestep), rather than for multiple moves in 
a particular context. 

Despite this though, since A * fixes A but not vice-versa, parameterizing. FT 
in terms of A* rather than A would make H more flexible. However since the 
premises only involve A, not A *, to simply the exposition here we will write H 
in terms of A. 
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